Abstract segmentation with sparse data

This is the source code for the paper Segmenting Scientific Abstracts into Discourse Categories: A Deep Learning-Based Approach for Sparse Labeled Data ( Arxiv preprint ), presented in JCDL 2020.

Data

This repository includes the dataset of segemented CS abstracts

Data	Source	Directory
`PubMed-non-RCT`¹	non RCT articles from PubMed	PubMedData/
`cs.NI`	cs.networks subdomain from arxiv.org	arxiv_final/
`cs.TLT`	IEEE Transactions on Learning Technologies	IEEE_final/TLT/
`cs.TPAMI`	IEEE Transactions on Transactions on Pattern Analysis and Machine Intelligence	IEEE_final/TPAMI/
`cs.combined`	`cs.NI` + `cs.TLT` + `cs.TPAMI`	Merged/

¹ The PubMed-non-RCT dataset was too large to be included in this repository. The code to bulid the dataset is provided along with a small sample of data.

Embeddings

We utilized the Common Crawl (42B tokens 300 dimention) GLOVE embedding in word2vec format.

Dependencies

python 3.5.6
tensorflow 1.10.0
keras 2.2.4
keras-self-attention 0.47.0
sklearn 0.20.3

Usage

Navigate to Code/
Set the PRETRAINED_EMBEDDINGS location² in line 5 of Code/embeddings_loader.py
Run abstract_analysis.py

    python abstract_analysis.py -h
    usage: abstract_analysis.py [-h] [-b] [-f] [-s]
                                [{arxiv,IEEE_TLT,IEEE_TPAMI,merged}]
                                [retraining_size]

    positional arguments:
      {arxiv,IEEE_TLT,IEEE_TPAMI,merged}
                            The evaluation dataset, default= arxiv
      retraining_size        Data size for fine tuning, default= 340

    optional arguments:
      -h, --help            show this help message and exit
      -b, --generate_baseline
                            For generating baseline without pre training
      -f, --fine_tune_with_pred
                            For evaluating the effect of transfer learning
      -s, --predict_and_save
                            To generate labels for unlabled abstracts,
                            conflicts with -f/--fine_tune_with_pred

² This might cause issues with line endings. To solve the issue open and save all files in the local system.

Contributors

Soumya Banerjee
Dr Debarshi Kr Sanyal
Dr Samiran Chattopadhyay