/biorxiv-extractor

Extracts data from medRxiv and bioRxiv preprints.

Primary LanguagePython

biorxiv-extractor

Extracts data from medRxiv and bioRxiv preprints.

Usage

python biorxiv.py [-h] [--noheader] [--section {methods,results,discussion}] doi {pdf,json,txt} outfile
  • doi is the DOI of the preprint to extract.

  • format must be either pdf, json, or txt, and specifies the file format to download the preprint in.

    • pdf downloads the preprint in its raw PDF format
    • json uses full-text HTML if available, otherwise an error is thrown. The main sections of the paper are labeled in a json list, and all subheaders are removed.
    • txt uses full-text HTML if available, otherwise an error is thrown. Using txt will by default extract the paper's entire text, excluding references. A specific section can be specified using --section, i.e. --section=methods, --section=results, or --section=discussion. All headings and subheadings will be included in the txt by default, but can be disabled using --noheader (recommended if your tool tokenizes on sentences, because rule-based tokenizers will often combine the header and the subsequent sentence, which can produce incorrect sentences).
  • outfile is where the result is saved.

Setup

Python 3.6.0 or greater is required for the json option to work correctly, but any version of Python 3 will work for pdf and txt extraction.

Dependencies