NGramTool (http://www.nlplab.cn/zhangle/ngram.html) =================================================== This package contains a set of tools for manipulating N-gram statistics from large raw corpus. The tools provided here distinguish themselves from other N-Gram extraction programs in that: 1. Unicode code base. All text are represented as Unicode (UCS16) internally for easily handling of oriental languages such as Chinese, Korean or Japanese. 2. Both Word N-gram and Character N-gram can be extracted. This feature is especially useful for Oriental language processing. Since in those languages (like Chinese) there is no explicit word boundary between words, and the basic processing units are usually single characters. 3. Statistical Substring Reduction (SSR) algorithms (see below). Statistical Substring Reduction is a powerful procedure to remove redundant N-grams from an N-gram set using N-gram statistics information. For example, if both "People's republic of China" and "People's republic" occur 10 times in the corpus, the latter will be removed, for it is the substring of the former. The detail can be found in the following paper: "Statistical Substring Reduction in Linear Time" Lv Xue-qiang, Zhang Le and Hu Junfeng IJCNLP-04, Hai Nan island, P.R.China. 2004 Support Platforms ================= Tested plasforms include: GNU/Linux, FreeBSD, NetBSD, MingW32, Cygwin. NOTICE: If you use these programs on win32 platforms please specify the file name as the last argument of the command line option. All options given after the file name will be ignored! Therefore use `foo -a2 file' instead of `foo file -a2'. Currently the following programs are provided: text2ngram ========================================================================= Extract N-Gram from raw corpus using an improved version of Nagao 1994 algorithm. Optionally the corpus can be parsed and saved to disk (in Unicode) and extract N-grams later. Type text2ngram -h for a short help. Here are some examples: 1. text2ngram -n3 file Extract all word trigrams from a file (words are separated by blank and tab). The output is sent to stdout. 2. text2ngram -n3 -m10 -f5 file Extract word 3 to 10-gram whose frequency >= 5 from a text file and output word N-grams to stdout. 3. text2ngram -c -n3 -m10 -f5 -F gbk -T gbk file The same as above except that this time we extract *character* N-grams (-c option) from a Chinese file encoded in GBK. The output is also GBK. If you do not specify -F gbk or -T gbk option, the input/output encodings are assumed to be UTF-8. 4. text2ngram -o corpus file Parse text file and generate a parsed corpus (corpus.ptable and corpus.ltable) for later word N-gram extraction (using `extractngram' utility). For more on what happens please refer to (Nagao 1994) 5. text2ngram -F gbk -c -o corpus file The same as 4 but prepare for *character* N-gram extraction from a file encoded in GBK (Chinese). extractngram ========================================================================= Extract N-gram from parsed table file generated by `text2ngarm' program. Example: 1. extractngram -n3 -m5 -i corpus will extract word 3-gram, 4-gram and 5-gram from corpus.ngram corpus.ptable and corput.ltable previously saved with `text2ngarm -o corpus' and write results to stdout. if -m5 is omitted, only 3-gram is extracted from ngram table. 2. extractngram -c -T gbk -n3 -m5 -i corpus The same as above example except that this time we extract *character* N-grams (-c option) and output N-grams in GBK encoding. The corpus.* files must be saved with `text2ngarm -c -o corpus' before. strreduction ========================================================================= Implement four Statistical Substring Reduction (SSR) algorithms. The first algorithm has a O(n^2) complexity and is useful-less in real world applications. It is included for completeness. The latter three SSR algorithms all have an ideal time complexity of O(n), so can be used on very large N-gram set. example: 1. strreduction -a2 < input > output Perform word level SSR on the input stream using algorithm 2 (can be 1-4), and output the result to output. The input format is one "word frequency" pair per line. 2. strreduction -a2 -c -F gbk -T gbk < input > output The same as the above example but we use character level SSR this time on Chinese input stream encoded in GBK (-F gbk). Output is GBK too (-T gbk). The input format is one "word frequency" pair per line. 3. strreduction -a2 -c -F gbk -T gbk -s -t -f 3 < input > output The same as the above example but we use a reduction threshold of 3 instead of the default value of 1 (-f 3). The output is also sorted (-s). Finally the SSR processing time is printed on stderr (-t). For a detail introduction on SSR operation please refer to: "Statistical Substring Reduction in Linear Time" Lv Xue-qiang, Zhang Le and Hu Junfeng IJCNLP-04, Hai Nan island, P.R.China. 2004 For building and installation instructions please see the INSTALL file. Author: Zhang Le <ejoy@xinhuanet.com>