This tool extracts RDF triples from the Code of Federal Regulations. It was developed for the Cornell Legal Information Institute.
Run ant jar
to compile generate VocabularyExtraction.jar
in the dist
If you are using Eclipse, you can also import the project directly, then add everything in the lib
directory to the build path.
java -Xms3072M -Xmx3072M -Dcornell.datasets.dir=/path/to/datasets/ -jar VocabularyExtraction.jar /path/to/xml/directory /path/to/rdf/output
If you would like to use the CoreNLP parser, you can do so by appending -useStanfordParser
to the line above.
For more comprehensive information, please refer to the JavaDoc for the Runner
This program can also be ran as a Hadoop job. Note that you will almost definitely want to use the OpenNLP parser; the CoreNLP parser requires 3 GB on each node and you will probably run out of memory.
- Export a runnable JAR file (instructions and Ant script coming soon)
- Upload the JAR and input files to Elastic MapReduce, or run locally like so:
export HADOOP_OPTS="-XX:+UseParallelGC -mx8g"
hadoop fs -copyFromLocal /path/to/stanford-corenlp-1.3.4-models.jar /tmp/cfr/preprocessor/models.jar
hadoop fs -copyFromLocal /path/to/agencies.txt /tmp/cfr/preprocessor/agencies.txt
hadoop jar preprocessor.jar /path/to/input/files /path/to/output/files -agencies /tmp/cfr/preprocessor/agencies.txt -resolvePronouns