/gpt2-arxiv

Fine-tuning GPT-2 to generate research paper abstracts

Primary LanguagePythonMIT LicenseMIT

CSE142 Project6: Fine-Tuning GPT-2 to Generate Research Paper Abstracts

This GPT-2 model is capable of generating abstracts given paper titles.

ArXiv Dataset

This repo contains a part of the dataset from arxiv_archive. Specifically, it contains papers under aritficial intelligence (AI), machine learning (LG), computation and language (CL), and computer vision and pattern recognition (CV) categories.

I use the titles and abstract of these papers to fine-tune my GPT-2 model.

Splits Count Percentage (%) BPE Token Count
Train 90,000 90.11 20,834,012
Validation 4,940 4.95 1,195,056
Test 4,940 4.95 1,218,754
Total 99,880 100 23,247,822

Usage

The script reads the .tsv file, adds special symbols to separate the titile and the abstract, sorts them by dates, and writes a text file for each category and an extra file of shuffled instances. It splits all examples into training, validation, and test sets.

> python preprocess_arxiv.py

Getting Started

Prerequisites

> pip install -r requirements.txt

Fine-Tune

> python train.py

Generate

> python generate.py

Reference

@dataset{r_stuart_geiger_2019_2533436,
    author= {R. Stuart Geiger},
    title={{ArXiV Archive: A tidy and complete archive of metadata for papers on arxiv.org, 1993-2019}},
    month=jan,
    year= 2019,
    publisher={Zenodo},
    version= {v1.0.1},
    doi={10.5281/zenodo.2533436},
    url={https://doi.org/10.5281/zenodo.2533436}
}

Author(s)

  • Chris Liu

  • Yiyun Zheng