This is a simple script to download figures from open access papers in PubMed Central database. It uses the PMC API to search for articles and uses the PMC Open Access Web Service API to get the XML of each paper to determine the figure information and URLs.
You can install using pip:
python -m venv env
source env/bin/activate
pip install -r requirements.txt
See example.ipynb
for a Jupyter notebook example.
query = '"Nature Genetics"[Journal] AND "open access"[filter]'
# Returns a list of paper IDs
result_ids = search_pmc(query, email="your-email-here@gmail.com", max_results=10)
# A list of PMC IDs
result_ids = ["10937393", "10864173"]
# A dataframe with figure information
figure_data = extract_pmc_figures(result_ids)
# Save the dataframe to a parquet file if you want to use it later
figure_data.write_parquet("figure_data.parquet")
The figure_data
dataframe looks like this:
┌──────────┬────────┬───────────┬──────────────────────┬─────────────────────┬─────────────────────┐
│ pmcid ┆ fig_id ┆ fig_label ┆ fig_title ┆ fig_desc ┆ image_url │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ str ┆ str │
╞══════════╪════════╪═══════════╪══════════════════════╪═════════════════════╪═════════════════════╡
│ 10937393 ┆ Fig1 ┆ Fig. 1 ┆ FANS-based isolation ┆ a, Schematic ┆ https://www.ncbi.nl │
│ ┆ ┆ ┆ of nuclei o… ┆ representation of ┆ m.nih.gov/pmc… │
│ ┆ ┆ ┆ ┆ t… ┆ │
│ 10937393 ┆ Fig2 ┆ Fig. 2 ┆ Purity and ┆ a, Heatmaps depict ┆ https://www.ncbi.nl │
│ ┆ ┆ ┆ reproducibility of ┆ log2-transfor… ┆ m.nih.gov/pmc… │
│ ┆ ┆ ┆ th… ┆ ┆ │
output_dir = "img"
download_status = download_imgs(figure_data, output_dir)
print("Failed downloads:")
download_status.filter(pl.col("status") != 200)
The download_status
dataframe will look like this:
┌──────────┬────────┬────────┐
│ pmcid ┆ fig_id ┆ status │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞══════════╪════════╪════════╡
│ 10937393 ┆ Fig1 ┆ 200 │
│ 10937393 ┆ Fig2 ┆ 200 │