Long document dataset

This dataset is for paper "Long Document Classification from Local Word Glimpses via Recurrent Attention Learning"

The data set is made up of paper downloaded from the arXiv website. It is collected by using arXiv sanity preserver program. (https://github.com/karpathy/arxiv-sanity-preserver/) Note that all downloaded papers are in pdf format. We used the pdf to txt program provided by the arXiv sanity preserver program to convert all papers into txt format. The dataset includes 11 different classes, the table below illustrates the details of the data set.

Class name Number of documents Average words
cs.AI ( Artificial Intelligence) 2995 6212
cs.CE (Computational Engineering) 2505 5777
cs.CV (Computer Vision) 2525 5630
cs.DS (Data Structures) 4136 7439
cs.IT (Information Theory ) 3233 5938
cs.NE (Neural and Evolutionary) 3012 5856
cs.PL (Programming Languages) 2901 7012
cs.SY (Systems and Control) 3106 5948
math.AC (Commutative Algebra ) 2885 5984
math.GR (Group Theory) 3065 6642
math.ST (Statistics Theory) 6025 6983