Metafilter Corpus - Frequency Tables

This is a collection of frequency tables calculated from metafilter comment text. It's made available as tables of various subcorpora, broken down by date range and by subsites reflected in any given table. Four metafilter subsites, Metafilter (mefi), Ask Metafilter (askme), Metatalk (meta), and Metafilter Music (music), are represented in these files; the "all sites" files incorporate comment data from all of those.

The files are tab-separated ASCII plain text; the zip files expand the file or files within to the local directory.

The files contain brief header information, followed by successive rows of three tab-separated values: raw word count, calculated frequency per million words, and the word token itself. Filenames indicate the subsite or subsites calculated, the date for which calculations begin, and the date for which the calculation is no longer running. For example, "askme--2006-01-01--2007-01-01.txt.zip" contains frequency data for all words used in comments on Ask Metafilter between January 1st, 2006 and December 31st, 2006, inclusive.

For more information about the process used to produce these files or for clarification of some of the file-specific notes below, please see the Metafilter Wiki page about the Metafilter Corpus project. You may also be interested in the data available at the weekly-updated Metafilter Infodump.

If you have specific questions about these files or would like to see something additional made available, contact cortex.

Complete tables

These files contain frequency information for all comments made between the launch of Metafilter in mid-1999 through 11:59 pm, Dec 31st, 2012. Note that subsites other than mefi were launched at later dates: meta in early 2000, askme in late 2003, music in the middle of 2006. Approximate word count and file size is provided for reference; exact word-count info is provided in the header of each file.

Yearly

As above, but each file contains a frequency table limited to a single year interval. These files are somewhat smaller than the corresponding Complete files; the largest are about 3MB.

Monthly

Each of these zip files contains twelve frequency files, one for each month of the year stated. These are comparable in size to the Complete files, topping out at 7MB.

Daily

Each of these zip files contains 365 (or 366 for leap years) individual daily files for the corresponding year. These zips are the largest in the collection, with allsites--2012 about 32MB total.