Meta-proteomics workflow.

About

Meta-proteomics workflow is an end-to-end data processing and analyzing pipeline for studying proteomes i.e studying protein identification and characterization using MS/MS data.

We identify the active organisms/species in a metagenome corresponding to a wet-lab sample obtained from JGI after gene sequencing. Then the researchers at PNNL culture these samples and make it appropriate to study it as a protein sample. This protein sample may have a single protein or a complex mixture of proteins. Later, this sample is passed through a mass spectrometry instrument to obtain a proprietary data format .RAW file. This file contains MS/MS spectrum i.e mass analysis(mass-to-charge (m/z) ratios) for each peptide sequences identified in the sample.

How to run the workflow:

Python codebase:
1. Make your input datasets ready- as described here
  - Make your input storage/ folder visible to workflow. You need to provide path in docker-compose.yml Note:
    - I left ./storage/ already configured assuming you kept inputs in the project directory itself.
    - Typically, a Study(such as stegen) has more than 1 datasets(RAW files-MSMS spectra) and multiple fastas to search against. This is information is must and a sample is provide here
2. Configure workflow as per need. Typically, we run in following ways:
  1. Fully-Tryptic with No modifications (recommended for large datasets such as Prosser Soil.)
  2. Fully-Tryptic with modifications
  3. partially-tryptic with Modification( such as MetOx).
  4. partially-tryptic No Modification. Notes: - User need to tweek configuration file. To reproduce results achieved for FICUS dataset studies(Hess, Stegen, Blanchard) - we provided parameter files and a pre-configured env file that could be use to run the workflow.
3. Must have installed docker and docker-compose on your system.
4. To run workflow, From project directory:
  1. make build_unified to start services. Notes: - (to take containers down and remove volumes: docker-compose down -v)
  2. make run_workflow It will create a storage/results folder and create all the necessary files.

WDL support codebase:
1. prepare you input.json make prepare-your-input Note: User need to generate the input.json file based on the - mapping (dataset(raw) to annotations(.faa & .gff )) - actual files respective file locations. For you help, a script has been provided.
2. run the WDL: Need an
  - execution engine(tested with cromwell-66) to run WDL
  - along with Java runtime(tested with openjdk 12.0.1) 1. if docker support 1. make run_wdl 2. if shifter support to run on cori: 1. make run_wdl_on_cori

More about workflow...

Documentation

PNNL-Comp-Mass-Spec/metaPro

Meta-proteomics workflow.

About

How to run the workflow: