/subtlex-word-frequencies

A list of words from the SUBTLEX movie subtitles corpus, sorted by frequency.

Primary LanguageJavaScript

subtlex-word-frequencies

An array of over 200,000 words sorted by frequency of use in spoken English.

The word counts are derived from SUBTLEX US, a corpus of subtitles from English-language movies. The corpus contains over 6,000,000 words. See the data/ directory for more info about the corpus.

Installation

Download node at nodejs.org and install it, if you haven't already.

npm install subtlex-word-frequencies --save

Usage

const words = require("subtlex-word-frequencies")

console.log(words.length)
// 200182

words.slice(0,3)
/*
[ { word: 'you', count: 1848036 },
  { word: 'i', count: 1480046 },
  { word: 'the', count: 1472467 } ]
*/

words
  .filter(function(word){ return word.word.match(/chick/) })
  .slice(0,5)
/*
[ { word: 'chicken', count: 3148 },
  { word: 'chick', count: 1282 },
  { word: 'chicks', count: 724 },
  { word: 'chickens', count: 516 },
  { word: 'chickenshit', count: 81 } ]
*/

Tests

npm install
npm test

Dependencies

None

Dev Dependencies

  • lodash: The modern build of lodash modular utilities.
  • split2: split a Text Stream into a Line Stream, using Stream 3
  • through2: A tiny wrapper around Node streams2 Transform to avoid explicit subclassing noise
  • extract: Extract specific properties from objects and generate new one.
  • standard: JavaScript Standard Style

License

ISC

Generated by package-json-to-readme