/pathway-figure-ocr

Extracting gene sets from published pathway figures

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

Pathway Figure OCR

The goal of this project is to extract identifiable genes, proteins and metabolites from published pathway figures. In addition to all the code for assembling and running the Pathway Figure OCR pipeline, this repo contains scripts specific to the QC, analysis and figure generation involved in our publications of the work. Here we document a few of the key files and folders relevant to each paper:

This work is supported by NIGMS, R01GM100039

Install

Warning: this project is still in development and is not ready for production or even dev releases by external teams. So, don't expect things to work without some troubleshooting :) Contact us via Issues if you're interested in contributing to the development. All our code is open source.

  1. Install Nix
  2. Clone this repo: git clone https://github.com/wikipathways/pathway-figure-ocr.git
  3. Enter environment: cd pathway-figure-ocr && nix-shell
  4. Launch Jupyter: jupyter lab (automatically opens notebook in browser)

Pipeline

The Jupyter Notebooks used to run the PFOCR pipeline are all in ./notebooks. Run them in the following order:

  1. pfocr_fetch.R.ipynb: get a list of likely pathway figures
  2. get_figures.ipynb: download those figures
  3. gcv_automl.ipynb: use a machine learning model we trained earlier to distinguish pathway vs. non-pathway figures
  4. gcv_ocr.ipynb: run OCR on the figures classified as pathway
  5. get_lexicon.ipynb: note that we actually just re-used the 20200224 lexicon for 20210515, so we didn't really finish this file.
  6. pp_classic.ipynb: extract genes (pp_ahocorasick.ipynb is an alternative that should work even better once validated.)
  7. pubtator.ipynb: Extract chemicals and diseases via PubTator.
  8. merge_2020_2021.ipynb: this was just for the merge of 20200224 and 20210515. Obviously, it would require being updated for any other merge. Note this notebook is also where we get the metadata for the papers.
  9. bte_export.ipynb: Export chemicals, diseases and genes for use in BioThings Explorer.
  10. bte_export_csv_files.ipynb: Export figure data as CSV files for use in BioThings Explorer.

Note that we used a database for 20200224 but not for 20210515. Any future runs or merges will probably not need to use the old database.

Internal Notes

xpm2nix

In ./xpm2nix, you'll find packages from external package manager(s) made available as Nix packages. xpm is just an abbreviation we made up to refer to any eXternal Package Manager.

For Python, we're using poetry2nix.

cd xpm2nix/python-modules

To add a package:

poetry add --lock jupytext

To update packages:

poetry update --lock

For JavaScript / Node.js, we're using node2nix.

cd xpm2nix/node-packages

To add a package:

npm install --package-lock-only --save @arbennett/base16-gruvbox-dark

To update packages:

./update