code for vmw-thu-hackathon2019 problem 5
original data stored in /patent_doc
Run crawler.py
to get data from PatFT and store in /data
, model.py
for training model and model_selection.py
for selecting LDA model. model_selection.py
should be run after data has been collected and texts have been split by model.py
. citation_graph.py
generates citation graphs.
Models are stored in /models
and visualization results could be found in /visualization
Analysis of patent structures in AI field
Fan Xiaoxiong, Li yuze, Ma Xiaoyang, You Yuyang
Analyze the given dataset of AI patents, and find citations and dependencies among the patents.
Citations could be found from official websites of USPTO, but dependencies shall be found by analyzing topics and finding similarities.
Visualize the found results.
Original data are stored in /patent_doc
in pdf forms, whose file names are patent numbers.
We get patent information by crawling from Patent Full-Text Database (PatFT), stored in /data
as json forms.
The whole project could be separated into four parts: data collection, data preprocessing, model training and selection, visualization.
- Data Collection: crawling necessary information from PatFT.
- Data Preprocessing: use regular expressions and split texts, remove stopwords and generate bigram/trigram models, and transform into TF-IDF models.
- Model Training and Selection: use gensim to train LDA & HDP models, and select by coherence score, we then use a joint model of both models
- Visualization: use pyLDAvis and networkx to generate graphs.
For more details, see Report.docx
.
See report for details.
Visualized results are stored in /visualization
- Citation graphs: these graphs show citations and relationships (prior arts) between given patents.
- Similarity graphs: similarities are shown by colors of edges, higher similarities are illustrated by deeper colors.
- Topic visualization: a visualization of trained LDA model.
The models could achieve coherence scores of 0.55 and 0.71.