The source code and data used in the KDD'2019 paper Mining Algorithm Roadmap in Scientific Publications. A corresponding online demo can be found at http://fts.cs.ucsb.edu/roadmap.
Python 3.6
pytorch 1.0
sqlite3
pdffigures, https://github.com/allenai/pdffigures
Other python requirements, pip install -r requirements.txt
Codes in preprocess folder prepare data needed for constructing Algorithm Roadmap. In the experiment, datasets are NIPS, ACL and VLDB.
cd preprocess/
Set the correct data path and use the following script to batchly execute pdftotext.
python convert2txt_multithread.py ${dataset}
Clean the converted txt files, remove non-ascii characters etc.
python textcleaner.py ${dataset}
Extract tables from pdf files for later weak supervision extraction, relying on pdffigures.
python pdf2figure.py tablejson ${dataset}
Use sqlite3 database to store the corpus for later instance collection.
python create_db.py ${dataset}
python concat_docs_in_one_file.py ${dataset}
python pretrain_wordvec.py ${dataset}
python build_vocab.py ${dataset}
Parse table json files to extract weak supervision.
python parse_table.py ${dataset}
Save co-occured acronyms which are baseline methods used in the experiment.
python save_coocurred_abbvs.py ${dataset}
Cut off long paragraphs and long sentences, do labelling, padding, and data splitting for final relation extraction. Prepare instances for acronym pairs, feature_type standard or paragraph denote single-sentence or cross-sentence instances.
python parse_abbv_in_sents.py ${dataset}
python prepare_final_data.py ${dataset} ${feature_type}
Pre-processed NIPS, ACL, VLDB datasets before running prepare_final_data.py.
https://drive.google.com/open?id=1R00G5xP141SO5oCt9zL8-COsX5G2H8dy
Train our CANTOR relation extraction model.
python train_cantor.py ${dataset} cantor ${feature_type}
Evaluate model and save predictions.
python model_eval.py ${dataset} cantor ${feature_type}
Construct algorithm roadmap for a query.
python construct_roadmap.py
If you found the repository useful, please cite the following:
H. Zha, W. Chen, K. L, X. Yan, Mining Algorithm Roadmap in Scientific Publications
@inproceedings{zha2019mining,
title={Mining Algorithm Roadmap in Scientific Publications},
author={Hanwen Zha, Wenhu Chen, Keqian Li and Xifeng Yan},
booktitle={Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery \& Data Mining., {KDD}},
year={2019}
}