A probabilistic word chain extraction tool, ranking word n-grams by their probability to form (a sort of) collocation.
The tool:
- reads a set of texts from a given directory structure (with each subdirectory containing a single file);
- keeps only characters, digits and single quotes, replacing all else with whitespace;
- splits all remaining "words" on the spaces;
- removes stopwords from a wordlist file (expected to be found in the running directory, by the name "english.list" and one stopword per line);
- removes double quotes (either '' or ");
- removes genitives ("'s" forms).
After the above tokenization and cleaning process the system estimates a ratio for each n-gram within a length span (default: 2 to 5). The ratio expresses the probability of occurrence of the n-gram in the corpus input to the random probability of seeing the n-gram, if the text generation process was a random generation procedure.
This implementation is close - but not identical - to the description of Symbols and Non-symbols in my thesis and in the following paper:
Giannakopoulos, George, Vangelis Karkaletsis, George Vouros, and Panagiotis Stamatopoulos 2008; Summarization System Evaluation Revisited: N-Gram Graphs. ACM Trans. Speech Lang. Process. 5(3): 1–39.
This work is under the Apache License v2.