Text Mining for Hidden Relations and Trending
Directory Description
data/robot_robotic/patents.csv -> table for the parsed patents' meta-data data/robot_robotic/raw -> the original parsed text files data/robot_robotic/tm_xml -> the topic modeling output in XML formatdata/documentation/proposal -> the proposal tex and pdf files data/documentation/report -> see final2.pdf for the finalized version
data/src -> the source code we wrote for this project, which is explained below
Crawler
/src/crawler -> Java code used to scrape down the patents data > java URLGenerator > robot_robotic_url.txt // generate the patent url for parsing > java Crawler robot_robotic_url.txt // generates the raw text data as seen in /data/robot_robotic/raw
OLAP
/src/OLAP -> Java code used to slice the original raw patent text by each year > java Slicer /data/robot_robotic/ // generate sub directory by year as seen in this folder
Mallet
/src/mallet -> Java code used to generate mallet commands for LDA > java MalletRunner > toRun.txt // outputs the necessary commands to run LDA for all years
Parsing Mallet XML output
/src/preprocessing -> Java code used to parse the XML output from the Topic Modeling package of MALLET > java XMLParser /data/robot_robotic/tm_xml // generate csv file for each year, with row as year-topic and column as the word vector
Thematic Particle Clustering (TPC)
/src/TPC -> Matlab code used to run all the experiments we've run for TPC Simply uncomment the corresponding commands in the code and run particleClustering() in Matlab
Topic Convergence Graph (TCG)
Included files: tm_results: is a directory with all the topics from 1981 to 2013 MyDoc.java: wrapper class for documents Tuple.java: wrapper class for two data pieces at a time Topic.java wrapper class for holding word distributions XMLParser: Main class where most of the work was done. Reads in documents. Make topics. Parses topics from year to year. Prints out paths topics took to STDIO.Tree Convergence Graph Code Base The java code for TCG just spits out data that was used in the included TCG_graph.xlsx
To run the java code run the main method of XMLParser.java with 1 command line argument: /PATH/TO/tm_results tm_results is included with this turnin This code was created in eclipse, as such I do not have the linux command line to run these file.