A web spider to search and download supplementary materials (SuppMat) with specific keywords from NCBI PubMed Central® (PMC) .
input: keywords of titles and abstracts to search papers, and the keywords to search supplementary materials
output: all related supplementary materials.
git clone https://github.com/GrayXu/NCBI-SuppMat-Spider.git
pip install -r requirements.txt
python3 main.py
- Use pip to install dependencies.
- config main.py, and directly run it
note: pls make sure xlrd's version, otherwise it won't handle xlsx files
-
features
- progress bar
- optional keywords for searching in files
- ouput coordinates of keywords in xls&xlsx
- create soft links to related suppmats
- optionally keep un-related files as cache
- optional case sensitivity
- more account keys and proxy IPs to speed up (after scaling seacher to millions level, waiting time will be a disaster, so it's urgent)
-
support more formats
- csv, txt, tsv, html, xml
- xls, xlsx
- zip
- doc, docx
-
some trival bugs
- can't handle csv or tsv files with wrong suffix (e.g. a *.xls file but in csv formats, which is a bug from NCBI DB)
- download and check progress bar depends on the number of files instead of the size of files, and the estimated time from
tqdm
is not stable (hard to fix) - downloading and parsing need times, so actually the size of thread pool be larger than limits
- if you need to use proxy, edit
proxies
variable in the head of searcher.py - one IP is allowed to send 3 requests to NCBI PMC in 1 seconds If you register an accout and use its api_key, the number can be increased to 10.
...