This GPT-2 model is capable of generating abstracts given paper titles.
This repo contains a part of the dataset from arxiv_archive. Specifically, it contains papers under aritficial intelligence (AI), machine learning (LG), computation and language (CL), and computer vision and pattern recognition (CV) categories.
I use the titles and abstract of these papers to fine-tune my GPT-2 model.
Splits | Count | Percentage (%) | BPE Token Count |
---|---|---|---|
Train | 90,000 | 90.11 | 20,834,012 |
Validation | 4,940 | 4.95 | 1,195,056 |
Test | 4,940 | 4.95 | 1,218,754 |
Total | 99,880 | 100 | 23,247,822 |
The script reads the .tsv
file, adds special symbols to separate the titile and the abstract, sorts them by dates, and writes a text file for each category and an extra file of shuffled instances. It splits all examples into training, validation, and test sets.
> python preprocess_arxiv.py
> pip install -r requirements.txt
> python train.py
> python generate.py
@dataset{r_stuart_geiger_2019_2533436,
author= {R. Stuart Geiger},
title={{ArXiV Archive: A tidy and complete archive of metadata for papers on arxiv.org, 1993-2019}},
month=jan,
year= 2019,
publisher={Zenodo},
version= {v1.0.1},
doi={10.5281/zenodo.2533436},
url={https://doi.org/10.5281/zenodo.2533436}
}
-
Chris Liu
-
Yiyun Zheng