/CorpusProcessing-Hadoop-Multir

This is a collection of Hadoop jobs I used to process a Corpus of ~1M documents for input to Multir training.

Primary LanguageJava

CorpusProcessing-Hadoop-Multir

This is a collection of Hadoop jobs I used to process a Corpus of ~1M documents for input to Multir training.

###/parsing This component works by taking in a set of tokenized sentences in the input format below

Input Format:

SentenceID \t Token1 Token2 Token3 ......\n
1 \t This is an example sentence . \n
2 \t This is another example sentence . \n

It runs the CharnianJohnson parser https://github.com/BLLIP/bllip-parser on a hadoop cluster via hadoop streaming http://hadoop.apache.org/docs/r1.1.2/streaming.html

To use the CharniakJohnson parser with Hadoop Streaming you must put the directory in a jar and put that jar on the HDFS. Then when you run the hadoop-streaming app specify its location with the -archives option, this will extract the contained objects into a directory with the same name as the original jar.

/parsing/cjmapper.py is the mapper that calls the CJ executable and handles the I/O /parsing/run.sh describes how to write the hadoop-streaming command to run the application

###/pre-parse-processing

The pre-parse processing takes in the following information for each sentence:

  • Document Name
  • Tokens
  • Token Offsetes
  • Sentence
  • Sentence offsets

The Input is in the following format:

SentID \t DocumentName \t String of Tokens \t String Of Token Offsets \t Sentence Offsets \t Sentence String
1 \t AFPXXX.txt \t Today 's example of a token string \t 0:5 5:7 8:15 16:18 19:20 21:26 27:33 \t 100 133 \t Today's example of a token string

It outputs the following information for each sentence:

  • Part Of Speech Tags
  • Named Entity Recognition Tags
  • Lemmatized token tags

To run this module you should have the following jars:

  • hadoop-core-0.20.2-cdh3u1.jar
  • stanford-corenlp-1.3.5.jar
  • stanford-corenlp-1.3.5-models.jar
  • joda-time-2.1.jar

From the /pre-parse-processing directory follow these instructions to run the hadoop job

mkdir classFolder

javac -source 1.6 -target 1.6 -classpath /path/to/stanford-corenlp-1.3.5.jar:/path/to/org.apache.hadoop/hadoop-core/jars/hadoop-core-0.20.2-cdh3u1.jar:/path/to/stanford-corenlp-1.3.5-models.jar:/path/to/joda-time-2.1.jar -d classFolder src/hadoop/PreParseProcessor.java
jar -cvf preparseprocessor.jar -C classFolder/ .
adoop jar preparseprocessor.jar hadoop.PreParseProcessor -libjars /path/to/stanford-corenlp-1.3.5.jar,/path/to/org.apache.hadoop/hadoop-core/jars/hadoop-core-0.20.2-cdh3u1.jar,/path/to/stanford-corenlp-1.3.5-models.jar,/path/to/joda-time-2.1.jar /hdfs/path/to/input /hdfs/path/to/output 100(map tasks) 12(reduce tasks)

###/post-parse-processing

The post-parse processing takes in the following information for each sentence:

  • Parse String

The input is in the following format:

SentenceID \t parseString
0 \t (S1 (NP (NP (NNP BEIJING)) (, ,) (NP (NNP Oct.) (CD 28)) (PRN (-LRB- -LRB-) (NP (NNP Xinhua)) (-RRB- -RRB-))))

It outputs the following information for each setnence:

  • Dependency Parse

To run this module you should have the following jars:

  • hadoop-core-0.20.2-cdh3u1.jar
  • stanford-corenlp-1.3.5.jar

From the /post-parse-processing directory follow these instructions to run the hadoop job

mkdir classFolder

javac -source 1.6 -target 1.6 -classpath /path/to/stanford-corenlp-1.3.5.jar:/path/to/org.apache.hadoop/hadoop-core/jars/hadoop-core-0.20.2-cdh3u1.jar -d classFolder src/hadoop/PostParseProcessor.java
jar -cvf postparseprocessor.jar -C classFolder/ .
hadoop jar postparseprocessor.jar hadoop.PostParseProcessor -libjars /path/to/stanford-corenlp-1.3.5.jar,/path/to/org.apache.hadoop/hadoop-core/jars/hadoop-core-0.20.2-cdh3u1.jar /hdfs/path/to/input /hdfs/path/to/output 300(map taskts)