LexicalChainsExtraction

A probabilistic word chain extraction tool, ranking word n-grams by their probability to form (a sort of) collocation.

Overview

The tool:

reads a set of texts from a given directory structure (with each subdirectory containing a single file);
keeps only characters, digits and single quotes, replacing all else with whitespace;
splits all remaining "words" on the spaces;
removes stopwords from a wordlist file (expected to be found in the running directory, by the name "english.list" and one stopword per line);
removes double quotes (either '' or ");
removes genitives ("'s" forms).

After the above tokenization and cleaning process the system estimates a ratio for each n-gram within a length span (default: 2 to 5). The ratio expresses the probability of occurrence of the n-gram in the corpus input to the random probability of seeing the n-gram, if the text generation process was a random generation procedure.

License

This work is under the Apache License v2.

ggianna/LexicalChainsExtraction

LexicalChainsExtraction

Overview

Related reading

License