/node-lda

LDA topic modeling for node.js

Primary LanguageJavaScriptApache License 2.0Apache-2.0

LDA

Latent Dirichlet allocation (LDA) topic modeling in javascript for node.js. LDA is a machine learning algorithm that extracts topics and their related keywords from a collection of documents.

In LDA, a document may contain several different topics, each with their own related terms. The algorithm uses a probabilistic model for detecting the number of topics specified and extracting their related keywords. For example, a document may contain topics that could be classified as beach-related and weather-related. The beach topic may contain related words, such as sand, ocean, and water. Similarly, the weather topic may contain related words, such as sun, temperature, and clouds.

See http://en.wikipedia.org/wiki/Latent_Dirichlet_allocation

npm install lda

Usage

const lda = require("lda");

// Example document.
const text = "Cats are small. Dogs are big. Cats like to chase mice. Dogs like to eat bones.";

// Extract sentences.
const documents = text.match( /[^\.!\?]+[\.!\?]+/g );

// Run LDA to get terms for 2 topics (5 terms each).
const result = lda(documents, 2, 5);

The above example produces the following result with two topics (topic 1 is "cat-related", topic 2 is "dog-related"):

Topic 1
cats (0.21%)
dogs (0.19%)
small (0.1%)
mice (0.1%)
chase (0.1%)

Topic 2
dogs (0.21%)
cats (0.19%)
big (0.11%)
eat (0.1%)
bones (0.1%)

Output

LDA returns an array of topics, each containing an array of terms. The result contains the following format:

[
  [ { term: "dogs", probability: 0.2 },
    { term: "cats", probability: 0.2 },
    { term: "small", probability: 0.1 },
    { term: "mice", probability: 0.1 },
    { term: "chase", probability: 0.1 }
  ],
  [ { term: "dogs", probability: 0.2 },
    { term: "cats", probability: 0.2 },
    { term: "bones", probability: 0.11 },
    { term: "eat", probability: 0.1 },
    { term: "big", probability: 0.099 }
  ]
]

The result can be traversed as follows:

const result = lda(documents, 2, 5);

// For each topic.
for (const i in result) {
 const row = result[i];
 console.log(`Topic ${Number.parseInt(i) + 1}`);

 // For each term.
 for (const j in row) {
  const term = row[j];
  console.log(`${term.term} (${term.probability}%)`);
 }

 console.log("");
}

Additional Languages

LDA uses stop-words to ignore common terms in the text (for example: this, that, it, we). By default, the stop-words list uses English. To use additional languages, you can specify an array of language ids, as follows:

// Use English (this is the default).
result = lda(documents, 2, 5, ["en"]);

// Use German.
result = lda(documents, 2, 5, ["de"]);

// Use English + German.
result = lda(documents, 2, 5, ["en", "de"]);

To add a new language-specific stop-words list, create a file /lda/lib/stopwords_XX.js where XX is the id for the language. For example, a French stop-words list could be named "stopwords_fr.js". The contents of the file should follow the format of an existing stop-words list. The format is, as follows:

exports.stop_words = [
    "cette",
    "que",
    "une",
    "il"
];

Setting a Random Seed

A specific random seed can be used to compute the same terms and probabilities during subsequent runs. You can specify the random seed, as follows:

// Use the random seed 123.
result = lda(documents, 2, 5, null, null, null, 123);

Author

Kory Becker http://www.primaryobjects.com

Based on original javascript implementation https://github.com/awaisathar/lda.js