ericpan64/Covid19-Hospitalization-Prediction

Initial NLP Analysis

Closed this issue · 2 comments

Goal: using the CORD dataset, write a script that aggregates word frequencies across the different texts (feel free to add/adjust analysis as you see fit). Incorporate a Python NLP library of your choice (e.g. spaCy, CoreNLP)

Workflow in order

  • Download metadata.csv file from kaggle competition website. The 9th column in this file contains the abstracts. The file has about ~341k entries, so that many abstracts on covid 19. For simplicity now we can just use abstracts. Full text can be an extension/future thing mentioned in the conclusion

  • Parse out the abstracts

  • pip install csvtool see doc

  • cvstool -c 9 metadata.csv > abstracts.txt

  • Python script using spaCy and sciSpacy models to extract biomedical entities. Pick a pretrained model to use.

  • example Python code

  • models to consider here

    • most likely we'd use "en_ner_bc5cdr_md", a model that recognizes disease and chemical, terms and also has the highest f1 score and can limit down the number of entities we retrieve, making it more manageable. They can match up to some concept ids in our data set related to and drug exposure (chemical) and observation/procedure tables (disease)
    • if we have time, we can also use en_ner_bionlp13cg_md which will recognize many more entities

now the hard part

  • Aggregate the entities together across all the abstracts

    • this would be an excellent opportunities to use scala or any other big data tool, because we most likely would need them to run it within reasonable amount of time. This does not go into docker image, but a preprocessing step for feature selection.
    • we'd use counts and select the top most occurring disease and chemicals
    • here we may encounter challenges with stemming or other issues with terms that are similar but not exactly the same strings.
    • if it's really difficult to aggregate, we can limit ourselves to one-word entity and use another nlp package to take care of stemming.
  • match up those with concept ids

  • This is the part I thought for the longest time.... I think the easiest way to do this is to also run the concept name column in the dictionary file through the sciSpacy model(s) so that each concept id can be linked to one or more entities.

  • similar to challenge with aggregating entities in abstracts, we may need to take shortcut like taking only one-word entities or do some smart fuzzy string matching to map the entities from abstract to concept ids

  • identify concept ids that are important based on frequencies

  • and then finally use those features in our machine learning model

Updated the checklist, AFAIK from your update we can close this issue! (given the framework and initial results are set-up)

Let's open a separate, more specific issue for the next-steps during the meeting tomorrow