/arxiv-latex-extract

Extract latex source code from arxiv.org bulk archives

Primary LanguagePythonApache License 2.0Apache-2.0

ALE: arXiv LATEX Extract

ALE is a tool for bulk extracting LATEX sources from arXiv.org by processing arXiv Bulk Data. Unlike other tools that exclusively rely on Amazon S3 for downloading, ALE primarily utilizes the mirror on archive.org, which is a free alternative but may be out-of-date. If optionally boto3 is then also installed and the environment variables AWS_ACCESS_KEY and AWS_SECRET_KEY point to valid AWS credentials, missing buckets are retrieved from Amazon S3.

Installation

Clone the repository and install all requirements.

git clone https://github.com/potamides/arxiv-latex-extract.git
cd arxiv-latex-extract
pip install -r requirements.txt

In addition, this project needs latexpand to flatten LATEX files, so make sure it is installed and on your PATH.

Usage

To launch the script execute main.py:

python main.py

It will display a progress bar and extracted files will be saved in extracted/. Archive files are downloaded to archives/ as needed and deleted right after. By default, to keep the number of retrieved files manageable, this script does only process papers released after January 1st 2010 which contain the phrase tikzpicture. To change this behavior adapt the modulino in main.py to your liking.

Limitations

While this project worked wonderfully for my task, it is still a messy script that was hacked together in a short amount of time. Use at your own risk!

Acknowledgments

The code for cleaning up LATEX files is largely based on the arXiv processing code of RedPajama-Data.