This dataset is for paper "Long Document Classification from Local Word Glimpses via Recurrent Attention Learning"
The data set is made up of paper downloaded from the arXiv website. It is collected by using arXiv sanity preserver program. (https://github.com/karpathy/arxiv-sanity-preserver/) Note that all downloaded papers are in pdf format. We used the pdf to txt program provided by the arXiv sanity preserver program to convert all papers into txt format. The dataset includes 11 different classes, the table below illustrates the details of the data set.
Class name | Number of documents | Average words |
---|---|---|
cs.AI ( Artificial Intelligence) | 2995 | 6212 |
cs.CE (Computational Engineering) | 2505 | 5777 |
cs.CV (Computer Vision) | 2525 | 5630 |
cs.DS (Data Structures) | 4136 | 7439 |
cs.IT (Information Theory ) | 3233 | 5938 |
cs.NE (Neural and Evolutionary) | 3012 | 5856 |
cs.PL (Programming Languages) | 2901 | 7012 |
cs.SY (Systems and Control) | 3106 | 5948 |
math.AC (Commutative Algebra ) | 2885 | 5984 |
math.GR (Group Theory) | 3065 | 6642 |
math.ST (Statistics Theory) | 6025 | 6983 |