Raw text pre-processing

Question

Raw text pre-processing

cjbayesian opened this issue 6 years ago · 6 comments

cjbayesian commented 6 years ago

Remove references
Strip special characters
Stop words
tokenize

Answer 1 · 2019-02-15T02:11:49.000Z

Perhaps use built-ins from nltk?

Answer 2 · 2019-02-15T03:58:05.000Z

Maybe better defaults with spaCy, but there probably won't be a substantive difference for what we're doing.

Answer 3 · 2019-02-15T15:01:04.000Z

@tnaumann Good thought, I just went ahead and added it to the conda environment. 7962ec3

I don't have much experience with it, but looking at the docs it seems like it is pretty straightforward to produce a processing pipeline. @beamandrew & @michaelchughes do you have a toolset in mind for the topic modeling?

Answer 4 · 2019-02-15T15:59:47.000Z

I would like for the fitted model to be compatible with LDAVis because it's fun interactive way to explore topic models, and it would be easy to put it up as a shiny app to accompany the paper:

https://github.com/cpsievert/LDAvis

So any of the packages listed on the page would be fine, though I've personally had good experience with the lda pacakge.

I know we tend to be pythonistas in ML, but what about using the tidytext package for cleaning?

https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html

Answer 5 · 2019-02-15T21:04:06.000Z

Yep, happy to have the topic modelers use their tool of choice. In that case, maybe the acquire and extract phases end at the creation of the well formatted and organized .txt files and associated metadata?

Answer 6 · 2019-02-17T14:50:39.000Z

Added a script to build datasets compatible with LDAvis. See https://github.com/cjbayesian/ml4h_paper_2019/blob/1dfef5e392a94a021effeff726baad8f0a78f9d8/create_r_dataset.R
I also uploaded these .rdata files to our drive folder.