Using anaconda:
conda env create -f environment.yml
source activate ml4h_papers
jupyter notebook
- Get the metadata for all papers
NeurIPS_Proceedings_Metadata.ipynb
- Download the pdfs
NeurIPS_Proceedings_dl_pdfs.ipynb
- Extract raw text from pdfs:
python pdf2text.py ./data/ML4H2018/pdf/ ./data/ML4H2018/txt/
python pdf2text.py ./data/NeurIPS2018/pdf/ ./data/NeurIPS2018/txt/
- Pre-process raw text
- Run text pre-processing
NeurIPS_Proceedings_preprocess.ipynb
OR - Run the
create_r_dataset.R
script to generate.rdata
files
- Run text pre-processing
- Optional: run the example topic model and visualization script
ldavis.R
.
- Python >= 3.7
- requests
- beautifulsoup4
- pandas
- json
- tm
- LDAvis
- lda
- pdftotext version 0.64.0