This is the source code for the paper Segmenting Scientific Abstracts into Discourse Categories: A Deep Learning-Based Approach for Sparse Labeled Data ( Arxiv preprint ), presented in JCDL 2020.
This repository includes the dataset of segemented CS abstracts
Data | Source | Directory |
---|---|---|
PubMed-non-RCT 1 |
non RCT articles from PubMed | PubMedData/ |
cs.NI |
cs.networks subdomain from arxiv.org | arxiv_final/ |
cs.TLT |
IEEE Transactions on Learning Technologies | IEEE_final/TLT/ |
cs.TPAMI |
IEEE Transactions on Transactions on Pattern Analysis and Machine Intelligence | IEEE_final/TPAMI/ |
cs.combined |
cs.NI + cs.TLT + cs.TPAMI |
Merged/ |
1 The PubMed-non-RCT
dataset was too large to be included in this repository. The code to bulid the dataset is provided along with a small sample of data.
We utilized the Common Crawl (42B tokens 300 dimention) GLOVE embedding in word2vec format.
- python 3.5.6
- tensorflow 1.10.0
- keras 2.2.4
- keras-self-attention 0.47.0
- sklearn 0.20.3
-
Navigate to Code/
-
Set the
PRETRAINED_EMBEDDINGS
location2 in line 5 of Code/embeddings_loader.py
python abstract_analysis.py -h
usage: abstract_analysis.py [-h] [-b] [-f] [-s]
[{arxiv,IEEE_TLT,IEEE_TPAMI,merged}]
[retraining_size]
positional arguments:
{arxiv,IEEE_TLT,IEEE_TPAMI,merged}
The evaluation dataset, default= arxiv
retraining_size Data size for fine tuning, default= 340
optional arguments:
-h, --help show this help message and exit
-b, --generate_baseline
For generating baseline without pre training
-f, --fine_tune_with_pred
For evaluating the effect of transfer learning
-s, --predict_and_save
To generate labels for unlabled abstracts,
conflicts with -f/--fine_tune_with_pred
2 This might cause issues with line endings. To solve the issue open and save all files in the local system.
- Soumya Banerjee
- Dr Debarshi Kr Sanyal
- Dr Samiran Chattopadhyay