cjbayesian/ml4h_paper_2019

Raw text pre-processing

cjbayesian opened this issue · 6 comments

  • Remove references
  • Strip special characters
  • Stop words
  • tokenize

Perhaps use built-ins from nltk?

Maybe better defaults with spaCy, but there probably won't be a substantive difference for what we're doing.

@tnaumann Good thought, I just went ahead and added it to the conda environment. 7962ec3

I don't have much experience with it, but looking at the docs it seems like it is pretty straightforward to produce a processing pipeline. @beamandrew & @michaelchughes do you have a toolset in mind for the topic modeling?

I would like for the fitted model to be compatible with LDAVis because it's fun interactive way to explore topic models, and it would be easy to put it up as a shiny app to accompany the paper:

https://github.com/cpsievert/LDAvis

So any of the packages listed on the page would be fine, though I've personally had good experience with the lda pacakge.

I know we tend to be pythonistas in ML, but what about using the tidytext package for cleaning?

https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html

Yep, happy to have the topic modelers use their tool of choice. In that case, maybe the acquire and extract phases end at the creation of the well formatted and organized .txt files and associated metadata?

Added a script to build datasets compatible with LDAvis. See https://github.com/cjbayesian/ml4h_paper_2019/blob/1dfef5e392a94a021effeff726baad8f0a78f9d8/create_r_dataset.R
I also uploaded these .rdata files to our drive folder.