Long document dataset

This dataset is for paper "Long Document Classification from Local Word Glimpses via Recurrent Attention Learning"

The data set is made up of paper downloaded from the arXiv website. It is collected by using arXiv sanity preserver program. (https://github.com/karpathy/arxiv-sanity-preserver/) Note that all downloaded papers are in pdf format. We used the pdf to txt program provided by the arXiv sanity preserver program to convert all papers into txt format. The dataset includes 11 different classes, the table below illustrates the details of the data set.

Class name	Number of documents	Average words
cs.AI ( Artificial Intelligence)	2995	6212
cs.CE (Computational Engineering)	2505	5777
cs.CV (Computer Vision)	2525	5630
cs.DS (Data Structures)	4136	7439
cs.IT (Information Theory )	3233	5938
cs.NE (Neural and Evolutionary)	3012	5856
cs.PL (Programming Languages)	2901	7012
cs.SY (Systems and Control)	3106	5948
math.AC (Commutative Algebra )	2885	5984
math.GR (Group Theory)	3065	6642
math.ST (Statistics Theory)	6025	6983

LiqunW/Long-document-dataset

Long document dataset