Use Apache Hadoop to count the n-grams in a collection of files.
Stephen W. Thomas <sthomas@cs.queensu.ca>
This class uses Hadoop to count the number of occurrences of n-grams in a given set of text files.
An n-gram is a list of n words, which are adjacent to each other in a text file. For example, if the text file contained the text "one two three", there would be three 1-grams ("one", "two", "three"), two 2-grams ("one_two", "two_three"), and one 3-gram ("one_two_three").
This class has been tested against hadoop 1.0.3.
N-grams of sizes 1, 2, 3, 4, and 5 are supported. You can specify on the command line which you want to include via the third parameter (see below for details).
countngrams input_dir output_dir 1,2,3,4,5?
See the HOWTO.txt for instructions to compile and run the application.
See the HOWTO.txt for instructions to compile and run the application.
This application depends on Hadoop 1.0.3 or variants thereof.
Copyright (C) 2012 by Stephen W. Thomas sthomas@cs.queensu.ca
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.