The following untars the arxiv source tar and finds the math files using an internet connection
python3 process.py \
/media/hd1/arXiv_src/src/arXiv_src_2101_023.tar \
$HOME/rm_me_process \
--term math
Classifying with multiprocessing (also works on a single GPU)
singularity run --nv \
--bind $HOME/Documents/arxivDownload:/opt/arxivDownload,/media/hd1:/opt/data_dir \
$HOME/singul/runner.sif python3 embed/mp_classify.py \
--model /opt/data_dir/trained_models/lstm_classifier/lstm_Aug-19_17-22 \
--out /rm_me_path/with_mp_classify \
--mine /opt/data_dir/promath/math94/940{3,4,5}_001.tar.gz
example with singularity:
singularity run --nv
--bind $HOME/Documents/arxivDownload:/opt/arxivDownload,/media/hd1:/opt/data_dir \
$HOME/singul/runner.sif python3 embed/inference_ner.py \
--mine /opt/data_dir/glossary/inference_class_all/math96/*.xml.gz \
--model /opt/data_dir/trained_models/ner_model/lstm_ner/ner_Sep-29_03-45/exp_001 \
--out $HOME/rm_me_ner
MP_scripts/mpi_only_loop.py
slurm_scripts/mpi_joiner.sh
- Populating and examples SQLAlchemy databases
- Filling the arxiv metadata database using
databases/create_db_define_models.py
- Query join examples in sqlalchemy query language
- Filling the arxiv metadata database using
- Parsing Arxib Manifest and querying metadat.ipynb
- Using magic module to find file info
- Structure of the data in the manifest file
- using the dload.py script and its objects
- basic usage of the arxiv API package
- very disorganized, mostly scratch work
- Time stats check output and logs.ipynb
- code to read and interpret latexml log files
- plot time of latexml processing
- getting problem articles for latexml.ipynb
- Identify articles that are not included in the arxmliv database
- Try to process these problematic articles with either removing environments or with LaTeXTual
- Word embeddings generation and evaluation.py
- read the binary files produced by word2vec
- Get the raw text ready for embedders
- Search for arxiv.db for the tags of an article
- tSNE visualization of the tags of terms
- update_db.py
- USAGE: python update_db.py DATABASE MANIFEST.xml tar_src_path [--log ]
- Where database is a sqlite database and manifest is an xml file in the original format
- tar_src_path is the dir where the tar files can be found
- Ex. python3 update_db.py /mnt/databases/arxivDB.db ../arXiv_src_manifest_Oct_2019.xml /mnt/arXiv_src/
- process.py
- Xtraction class reads and extracts a arXiv tar files.
- Querying the arxiv metadata with the arxiv API and the arxiv.py package
- Xtraction(tarfilename, db='sqlite:///pathdb') to read metadata from a database instead of api
- Writing arxiv metadata to a database.
- Index the article ID column to speedup queries
CREATE INDEX id_ind on articles(id);
To search and article, run with the following query:
select tags from articles where id between "http://arxiv.org/abs/{0}" and "http://arxiv.org/abs/{0}{{";
- Count the articles in a year of tar files
SELECT count(articles.id) FROM manifest LEFT JOIN articles on manifest.id = articles.tarfile_id WHERE manifest.filename LIKE 'src/arXiv_src_06%' and articles.tags like '[{''term'': ''math%';
- Find the authors (in general) with the most publications
SELECT author, count(*) AS c FROM articles GROUP BY author ORDER BY c DESC LIMIT 10;
- Hack to find main article tag
SELECT count(tags) FROM articles where tags LIKE '[{''term'': ''math.DG''%';
- find repeated entries where DataId is the repeated term
SELECT DataId, COUNT(*) c FROM DataTab GROUP BY DataId HAVING c > 1;
- Left join to quickly find all articles in a tar file
SELECT articles.id, tags FROM manifest LEFT JOIN articles on manifest.id = articles.tarfile_id WHERE manifest.id = 1747;
- To check the files with with unknown encoding:
find . -name 'latexml_commentary.txt' -exec grep Ignoring {} \;
- To process the first .tex file to an .xml file of the same name and last part of error stream to latexml_commentary.txt
TEXF=`ls *.tex`; latexml $TEXF.tex 2>&1 > ${TEXF%.*}.xml | tail -15 >> latexml_commentary.txt
- To find directories unprocessed by latexml (don't have a latexml_errors_mess.txt file)
find ./* -maxdepth 0 -type d '!' -exec test -e "{}/latexml_errors_mess.txt" ';' -print
- To filter manually cancelled latexml processes search in the latex_errors file with:
Fatal:perl:die Perl died
- When LaTeXML runs out of memory for example in 1504.06138
(Processing definitions /usOut of memory!
- There is a limit of around 500 articles id that the API can handle.
- In 2014 the article name format changed from YYMM.{4 digits} to 5 digits.
- In March 2007, the naming format of the articles changed from 0701/math0701672 to 1503/1503.08375.
- The distribution of the sizes of the tar files in the manifest:
Counter({Interval(-1857373.906, 382162956.2, closed='right'): 273,
Interval(382162956.2, 764272737.4, closed='right'): 2222,
Interval(764272737.4, 1146382518.6, closed='right'): 3,
Interval(1528492299.8, 1910602081.0, closed='right'): 1})
Large files
src/arXiv_src_1405_008.tar|805505033
src/arXiv_src_1512_003.tar|1910602081
src/arXiv_src_1812_033.tar|835663353
src/arXiv_src_1908_006.tar|803583004
- ltx_theorem_df -- /math.0406533
- LateXML did not finish 2014/1411.6225/bcdr_en.tex
- All the tests in the ./tests directory are discovered with the command. Run from the repo directory
PYTHONPATH="./tests" python -m unittest discover -s tests
Or, from the tests
directory, run:
PYTHONPATH=".." python -m unittest discover -s tests
The xml_file.xml is modified by the search.py module:
- processed, is False by default.
- search exists only when locate has been ran on the filesystem. It is true, when the file was found and False if the file has been searched and not found.