python biorxiv.py [-h] [--noheader] [--section {methods,results,discussion}] doi {pdf,json,txt} outfile
-
doi
is the DOI of the preprint to extract. -
format
must be eitherpdf
,json
, ortxt
, and specifies the file format to download the preprint in.pdf
downloads the preprint in its raw PDF formatjson
uses full-text HTML if available, otherwise an error is thrown. The main sections of the paper are labeled in ajson
list, and all subheaders are removed.txt
uses full-text HTML if available, otherwise an error is thrown. Usingtxt
will by default extract the paper's entire text, excluding references. A specific section can be specified using--section
, i.e.--section=methods
,--section=results
, or--section=discussion
. All headings and subheadings will be included in thetxt
by default, but can be disabled using--noheader
(recommended if your tool tokenizes on sentences, because rule-based tokenizers will often combine the header and the subsequent sentence, which can produce incorrect sentences).
-
outfile
is where the result is saved.
beautifulsoup4 >= 4.9.1
, install withpip install beautifulsoup4
requests >= 2.22.0
, install withpip install requests