
Various corpora for handy access.

MIT LicenseMIT


Personally curated corpora (currently news headlines).


News headlines starting approx. 05/2017 from various outlets RSS feeds. May not contain a comprehensive set of all published article over a given period of time, and many outlets are no longer gathered as time went on.

Headlines are provided in 2 directory/file structures, Categorized and Dated. A root directory with the date the corpus was created is common to both formats.

  • Categorized contains a single file for each news outlet.
  • Dated contains a directory for each news outlet, containing individual files for every individual day there were headlines from that outlet.