Pathway Figure OCR

The goal of this project is to extract identifiable genes, proteins and metabolites from published pathway figures. In addition to all the code for assembling and running the Pathway Figure OCR pipeline, this repo contains scripts specific to the QC, analysis and figure generation involved in our publications of the work. Here we document a few of the key files and folders relevant to each paper:

25 Years of Pathway Figures (Genome Biology 2020)
- Interactive search tool for 65k pathway figures and their gene content: shiny app and code
- NIH Figshare of identified pathway figures and OCR results as RDS datasets: collection
- UpSet plot of top text and figure genes: script
- Pie chart data for top disease terms for text and figure genes: script
- Overlap matrix for Hippo Signaling pathway figure genes: script
- Machine learning progression plots: script
- Local database name: pfocr20200224
Identifying Genes in Published Pathway Figure Images (BioRxiv 2018)
- Performance assessment figures: folder
- Local database name: pfocr2018121717

This work is supported by NIGMS, R01GM100039

Install

Warning: this project is still in development and is not ready for production or even dev releases by external teams. So, don't expect things to work without some troubleshooting :) Contact us via Issues if you're interested in contributing to the development. All our code is open source.

Install Nix
Clone this repo: git clone https://github.com/wikipathways/pathway-figure-ocr.git
Enter environment: cd pathway-figure-ocr && nix-shell
Launch Jupyter: jupyter lab (automatically opens notebook in browser)

Pipeline

The Jupyter Notebooks used to run the PFOCR pipeline are all in ./notebooks. Run them in the following order:

pfocr_fetch.R.ipynb: get a list of likely pathway figures
get_figures.ipynb: download those figures
gcv_automl.ipynb: use a machine learning model we trained earlier to distinguish pathway vs. non-pathway figures
gcv_ocr.ipynb: run OCR on the figures classified as pathway
get_lexicon.ipynb: note that we actually just re-used the 20200224 lexicon for 20210515, so we didn't really finish this file.
pp_classic.ipynb: extract genes (pp_ahocorasick.ipynb is an alternative that should work even better once validated.)
pubtator.ipynb: Extract chemicals and diseases via PubTator.
merge_2020_2021.ipynb: this was just for the merge of 20200224 and 20210515. Obviously, it would require being updated for any other merge. Note this notebook is also where we get the metadata for the papers.
bte_export.ipynb: Export chemicals, diseases and genes for use in BioThings Explorer.
bte_export_csv_files.ipynb: Export figure data as CSV files for use in BioThings Explorer.

Note that we used a database for 20200224 but not for 20210515. Any future runs or merges will probably not need to use the old database.

Internal Notes

xpm2nix

In ./xpm2nix, you'll find packages from external package manager(s) made available as Nix packages. xpm is just an abbreviation we made up to refer to any eXternal Package Manager.

For Python, we're using poetry2nix.

cd xpm2nix/python-modules

To add a package:

poetry add --lock jupytext

To update packages:

poetry update --lock

For JavaScript / Node.js, we're using node2nix.

cd xpm2nix/node-packages