iai-group/bsc2021-arxivtables

Enhacements to arXiv getter

Closed this issue · 0 comments

  • paper_downloader():
    • GET requests to URL of a paper
    • Download .tar.gz file
    • Place downloaded file into appropriate folder (separate by year/month paper_id.tgz)
    • Keep a daily log in a text file (like logs/downloader/YYYYMMDD.log)
  • add_to_database():
    • Keep the list of paper IDs that got downloaded at a given day to a text file YYYYMMDD.csv/txt
  • call table_extractor():
    • Process the list of papers for a given day (based on the paper list csv/txt file)
    • As part of the extraction process, extract .tar.gz file into a tmp folder (we want to keep files compressed; just uncompress for extraction)