/SkimLit

NLP model (trained on PubMed200k RCT dataset) to make reading medical abstracts (from pubmed etc) easier

Primary LanguageJupyter Notebook

SkimLit

NLP model (trained on PubMed200k RCT dataset) to make reading medical abstracts (from pubmed etc) easier

The purpose of this notebook is to build an NLP model to make reading medical abstracts (from pubmed etc) easier.

The paper being implemented here is : PubMed 200k RCT: a Dataset for Sequenctial Sentence Classification in Medical Abstracts

Model used in the paper : Neural networks for joint sentence classification in medical paper abstracts

About the Dataset

PubMed 200k RCT, dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion.

Dernoncourt, F., & Lee, J. Y. (2017).
Pubmed 200k rct: a dataset for sequential sentence classification in medical abstracts.
arXiv preprint arXiv:1710.06071.

Modelling Experiments

  • Model 0: TF-IDF Multinomial Naivee Bayes classifier (baseline)
  • Model 1: Conv1D with token embeddings
  • Model 2: Feature Extraction with pretrained token embeddings
  • Model 3: Conv1D with character embeddings
  • Model 4: Combining pretrained token embeddings + characters embeddings (hybrid embedding layer)
  • Model 5: Tribrid Transfer Learning with pretrained token embeddings + character embeddings + positional embeddings

Things left to-do:

  • MAKE EXAMPLE PREDICTIONS!!
  • Add tensorboard callbacks
  • Train the model on the 200k dataset
  • etc.