cjbayesian/ml4h_paper_2019

Data Pipeline for analysis of ML4H 2018 workshop at NeurIPS.

Jupyter Notebook

ML4H 2018 NeurIPS paper analysis

Pipeline:

Using anaconda:

conda env create -f environment.yml
source activate ml4h_papers
jupyter notebook

Get the metadata for all papers NeurIPS_Proceedings_Metadata.ipynb
Download the pdfs NeurIPS_Proceedings_dl_pdfs.ipynb
Extract raw text from pdfs:

python pdf2text.py ./data/ML4H2018/pdf/ ./data/ML4H2018/txt/
python pdf2text.py ./data/NeurIPS2018/pdf/ ./data/NeurIPS2018/txt/

Pre-process raw text
- Run text pre-processing NeurIPS_Proceedings_preprocess.ipynb OR
- Run the create_r_dataset.R script to generate .rdata files
Optional: run the example topic model and visualization script ldavis.R.

Requires

Python

Python >= 3.7
requests
beautifulsoup4
pandas
json

R

tm
LDAvis
lda

Other

pdftotext version 0.64.0