/PMC-figure-downloader

Extract figures and figure captions from PMC open access papers

Primary LanguageJupyter Notebook

PMC Figure Downloader

This is a simple script to download figures from open access papers in PubMed Central database. It uses the PMC API to search for articles and uses the PMC Open Access Web Service API to get the XML of each paper to determine the figure information and URLs.

Installation

You can install using pip:

python -m venv env
source env/bin/activate
pip install -r requirements.txt

Usage

See example.ipynb for a Jupyter notebook example.

Query PMC for open access papers

query = '"Nature Genetics"[Journal] AND "open access"[filter]'
# Returns a list of paper IDs
result_ids = search_pmc(query, email="your-email-here@gmail.com", max_results=10)

Extract figure information and URLs from a list of paper IDs

# A list of PMC IDs
result_ids = ["10937393", "10864173"]
# A dataframe with figure information
figure_data = extract_pmc_figures(result_ids)
# Save the dataframe to a parquet file if you want to use it later
figure_data.write_parquet("figure_data.parquet")

The figure_data dataframe looks like this:

┌──────────┬────────┬───────────┬──────────────────────┬─────────────────────┬─────────────────────┐
│ pmcid    ┆ fig_id ┆ fig_label ┆ fig_title            ┆ fig_desc            ┆ image_url           │
│ ---      ┆ ---    ┆ ---       ┆ ---                  ┆ ---                 ┆ ---                 │
│ str      ┆ str    ┆ str       ┆ str                  ┆ str                 ┆ str                 │
╞══════════╪════════╪═══════════╪══════════════════════╪═════════════════════╪═════════════════════╡
│ 10937393 ┆ Fig1   ┆ Fig. 1    ┆ FANS-based isolation ┆ a, Schematic        ┆ https://www.ncbi.nl │
│          ┆        ┆           ┆ of nuclei o…         ┆ representation of   ┆ m.nih.gov/pmc…      │
│          ┆        ┆           ┆                      ┆ t…                  ┆                     │
│ 10937393 ┆ Fig2   ┆ Fig. 2    ┆ Purity and           ┆ a, Heatmaps depict  ┆ https://www.ncbi.nl │
│          ┆        ┆           ┆ reproducibility of   ┆ log2-transfor…      ┆ m.nih.gov/pmc…      │
│          ┆        ┆           ┆ th…                  ┆                     ┆                     │

Download figures to a directory

output_dir = "img"
download_status = download_imgs(figure_data, output_dir)
print("Failed downloads:")
download_status.filter(pl.col("status") != 200)

The download_status dataframe will look like this:

┌──────────┬────────┬────────┐
│ pmcid    ┆ fig_id ┆ status │
│ ---      ┆ ---    ┆ ---    │
│ str      ┆ str    ┆ i64    │
╞══════════╪════════╪════════╡
│ 10937393 ┆ Fig1   ┆ 200    │
│ 10937393 ┆ Fig2   ┆ 200    │