Data Pipeline

Downloading from arXiv

Processing with LaTeXML

The following untars the arxiv source tar and finds the math files using an internet connection

python3 process.py \
   /media/hd1/arXiv_src/src/arXiv_src_2101_023.tar \
   $HOME/rm_me_process \
   --term math

More Processing

Getting Labeled Definitions

Classifying Definitions

Classifying with multiprocessing (also works on a single GPU)

singularity run --nv \
      --bind $HOME/Documents/arxivDownload:/opt/arxivDownload,/media/hd1:/opt/data_dir \
    $HOME/singul/runner.sif python3 embed/mp_classify.py \
     --model /opt/data_dir/trained_models/lstm_classifier/lstm_Aug-19_17-22 \
     --out /rm_me_path/with_mp_classify \
     --mine /opt/data_dir/promath/math94/940{3,4,5}_001.tar.gz

NER

example with singularity:

singularity run --nv 
    --bind $HOME/Documents/arxivDownload:/opt/arxivDownload,/media/hd1:/opt/data_dir \
    $HOME/singul/runner.sif python3 embed/inference_ner.py \
    --mine /opt/data_dir/glossary/inference_class_all/math96/*.xml.gz \
    --model /opt/data_dir/trained_models/ner_model/lstm_ner/ner_Sep-29_03-45/exp_001 \
    --out $HOME/rm_me_ner

Joining Phrases

  • MP_scripts/mpi_only_loop.py
  • slurm_scripts/mpi_joiner.sh

Jupyter Notebooks

  • Populating and examples SQLAlchemy databases
    • Filling the arxiv metadata database using databases/create_db_define_models.py
    • Query join examples in sqlalchemy query language
  • Parsing Arxib Manifest and querying metadat.ipynb
    • Using magic module to find file info
    • Structure of the data in the manifest file
    • using the dload.py script and its objects
    • basic usage of the arxiv API package
    • very disorganized, mostly scratch work
  • Time stats check output and logs.ipynb
    • code to read and interpret latexml log files
    • plot time of latexml processing
  • getting problem articles for latexml.ipynb
    • Identify articles that are not included in the arxmliv database
    • Try to process these problematic articles with either removing environments or with LaTeXTual
  • Word embeddings generation and evaluation.py
    • read the binary files produced by word2vec
    • Get the raw text ready for embedders
    • Search for arxiv.db for the tags of an article
    • tSNE visualization of the tags of terms

Scripts

  • update_db.py
    • USAGE: python update_db.py DATABASE MANIFEST.xml tar_src_path [--log ]
    • Where database is a sqlite database and manifest is an xml file in the original format
    • tar_src_path is the dir where the tar files can be found
    • Ex. python3 update_db.py /mnt/databases/arxivDB.db ../arXiv_src_manifest_Oct_2019.xml /mnt/arXiv_src/
  • process.py
    • Xtraction class reads and extracts a arXiv tar files.
    • Querying the arxiv metadata with the arxiv API and the arxiv.py package
    • Xtraction(tarfilename, db='sqlite:///pathdb') to read metadata from a database instead of api
    • Writing arxiv metadata to a database.

Queries

  • Index the article ID column to speedup queries
CREATE INDEX id_ind on articles(id);

To search and article, run with the following query:

select tags from articles where id between "http://arxiv.org/abs/{0}" and "http://arxiv.org/abs/{0}{{";
  • Count the articles in a year of tar files
SELECT  count(articles.id) FROM manifest LEFT JOIN articles on manifest.id = articles.tarfile_id WHERE manifest.filename LIKE 'src/arXiv_src_06%' and articles.tags like '[{''term'': ''math%';
  • Find the authors (in general) with the most publications
SELECT author, count(*) AS c FROM articles GROUP BY author ORDER BY c DESC LIMIT 10;
  • Hack to find main article tag
 SELECT count(tags) FROM articles where tags LIKE '[{''term'': ''math.DG''%';
  • find repeated entries where DataId is the repeated term
SELECT DataId, COUNT(*) c FROM DataTab GROUP BY DataId HAVING c > 1;
  • Left join to quickly find all articles in a tar file
SELECT  articles.id, tags FROM manifest LEFT JOIN articles on manifest.id = articles.tarfile_id WHERE manifest.id = 1747;
  • To check the files with with unknown encoding:
   find . -name 'latexml_commentary.txt' -exec grep Ignoring {} \;
  • To process the first .tex file to an .xml file of the same name and last part of error stream to latexml_commentary.txt
TEXF=`ls *.tex`; latexml $TEXF.tex 2>&1 > ${TEXF%.*}.xml | tail -15 >> latexml_commentary.txt
  • To find directories unprocessed by latexml (don't have a latexml_errors_mess.txt file)
find ./* -maxdepth 0 -type d '!' -exec test -e "{}/latexml_errors_mess.txt" ';' -print
  • To filter manually cancelled latexml processes search in the latex_errors file with:
Fatal:perl:die Perl died
  • When LaTeXML runs out of memory for example in 1504.06138
(Processing definitions /usOut of memory!

Notes

  • There is a limit of around 500 articles id that the API can handle.
  • In 2014 the article name format changed from YYMM.{4 digits} to 5 digits.
  • In March 2007, the naming format of the articles changed from 0701/math0701672 to 1503/1503.08375.
  • The distribution of the sizes of the tar files in the manifest:
Counter({Interval(-1857373.906, 382162956.2, closed='right'): 273,
         Interval(382162956.2, 764272737.4, closed='right'): 2222,
         Interval(764272737.4, 1146382518.6, closed='right'): 3,
         Interval(1528492299.8, 1910602081.0, closed='right'): 1})
Large files
src/arXiv_src_1405_008.tar|805505033
src/arXiv_src_1512_003.tar|1910602081
src/arXiv_src_1812_033.tar|835663353
src/arXiv_src_1908_006.tar|803583004

Definitions Tags

  • ltx_theorem_df -- /math.0406533

Problems

  • LateXML did not finish 2014/1411.6225/bcdr_en.tex

Testing

  • All the tests in the ./tests directory are discovered with the command. Run from the repo directory
PYTHONPATH="./tests" python -m unittest discover -s tests

Or, from the tests directory, run:

PYTHONPATH=".." python -m unittest discover -s tests

The xml_file.xml is modified by the search.py module:

  • processed, is False by default.
  • search exists only when locate has been ran on the filesystem. It is true, when the file was found and False if the file has been searched and not found.