/Long-Read-Proteogenomics

A workflow for enhanced protein isoform detection through integration of long-read RNA-seq and mass spectrometry-based proteomics.

Primary LanguagePythonMIT LicenseMIT

reviewdog misspellTesting for Long Reads Proteogenomics without Sqanti

This Repository contains the complete software and documentation to execute the Long-Read-Proteogenomics Workflow.

Digital Object Identifiers

For the Genome Biology Manuscript: Enhanced Protein Isoform Characterization through Long Read Proteogenomics.

DOI Description
drawing Contains the version of the repository used for execution and generation of data
drawing Contains the input data from Jurkat Samples and Reference data used in execution of the Long-Read-Proteogenomics workflow contained in this repository
drawing Contains the output data from executing the Long-Read-Proteogenomics workflow using the Zenodo version of this repository
drawing Contains the version of analysis codes and codes for generating the figures using as input the output data from executing the Long-Read-Proteogenomics Workflow version specified above
drawing Contains the Test Data used with the GitHub Actions to ensure changes to this repository still execute and perform correctly
Sequence Read Archive (SRA) Project Reference Description
PRJNA783347 Long-Read RNA Sequencing Project for Jurkat Samples
PRJNA193719 Short-Read RNA Sequencing Project for Jurkat Samples

Sheynkman-Lab/Long-Read-Proteogenomics

Updated: 2022 January 30

This is the repository for the Long-Read Proteogenomics workflow. Written in Nextflow, it is a modular workflow beneficial to both the Transcriptomics and Proteomics fields. The data from both Long-Read IsoSeq sequencing with PacBio and Mass spectrometry-based proteomics used in the classification and analysis of protein isoforms expressed in Jurkat cells and described in the publication Enhanced protein isoform characterization through long-read proteogenomics, which will be made public in Fall 2022.

The output data resulting from the execution of this workflow for the Manuscript: Enhanced Protein Isoform Characterization through Long Read Proteogenomics. May be found here [insert Zenodo Reference here]. The Analysis to produce the figures for the manuscript may be found in the companion repository Long-Read Proteogenomics Analysis

A goal in the biomedical field is to delineate the protein isoforms that are expressed and have pathophysiological relevance. Towards this end, new approaches are needed to detect protein isoforms in clinical samples. Mass spectrometry (MS) is the main methodology for protein detection; however, poor coverage and incompleteness of protein databases limit its utility for isoform-resolved analysis. Fortunately, long-read RNA-seq approaches from PacBio and Oxford Nanopore platforms offer opportunities to leverage full-length transcript data for proteomics.

We introduce enhanced protein isoform detection through integrative “long read proteogenomics”. The core idea is to leverage long-read RNA-seq to generate a sample-specific database of full-length protein isoforms. We show that incorporation of long read data directly in the MS protein inference algorithms enables detection of hundreds of protein isoforms intractable to traditional MS. We also discover novel peptides that confirm translation of transcripts with retained introns and novel exons. Our pipeline is available as an open-source Nextflow pipeline, and every component of the work is publicly available and immediately extendable.

Proteogenomics is providing new insights into cancer and other diseases. The proteogenomics field will continue to grow, and, paired with increases in long-read sequencing adoption, we envision use of customized proteomics workflows tailored to individual patients.

We acknowledge the beginning kernels of this work were formed during the Fall of 2020 at the Cold Spring Harbor Laboratory Biological Data Science Codeathon.

We acknowledge Lifebit and the use of their platform Lifebit's CloudOS key in development of the open source software Nextflow workflow used in this work.

How to use this repository and Quick Start

This workflow is complex, bringing together two measurement technologies in a long-read proteogenomics approach for integrating sample-matched long-read RNA-seq and MS-based proteomics data to enhance isoform characterization. To orient the user with the steps involved in the transformation of raw measurement data to these fully resolved, identified and annotated results, we have developed this quick start, wiki documentation including vignettes.

How to use this repository

This repository is organized into modules and parts of this repository could be useful to different researchers to annotate their own raw data. The workflow is written in Nextflow, allowing it to be run on virtually any platform with alterations to the configurations and other adaptations. The visitor is encourated to fork clone and adapt and contribute. All are encouraged to use GitHub Issues to communicate with the contributors to this open source software project. Software addtions, modifications and contributions are done through GitHub Pull Requests

Module processes details are documented within the Wiki within this repository. As well as linked to the third party resources used in this workflow.

Vignettes have been developed to go into greater detail and walk the visitor through the visualization capabilities of the final annotated results and to walk the visitor through the workflow with presented here with the quick start

Quick Start

This quick start and steps were performed on a MacBook Pro running BigSur Version 11.4 with 16 GB 2667 MHz DDR48 RAM and a 2.3 GHz 8-Core Intel Core i9 processor.

The visitor will be walked through the pre-requisites, clone the library and execute with demonstration data also used in the GitHub Actions.

Obtain the Desktop DockerHub Application

In this quick start, Dockerhub Desktop Application for the Mac with an Intel Chip was used. Follow the instructions there to install.

Configure the Desktop DockerHub Application

On the MacBook Pro running BigSur Version 11.4 with 16 GB Ram, It was necessary to configure the Dockerhub resources to use 6GB of Ram.

Obtain and install miniconda

On the MacBook Pro, the 64-bit version of miniconda was downloaded and installed follow the installation instructions.

Create and activate a new conda environment lrp.

To begin, open a terminal window, ensuring the miniconda installation has completed, reboot the terminal shell. On the Mac, this is done within a zsh shell environment.

exec -l zsh

If you already have the environment, you can see what conda environments you have with the following commnad:

conda info --envs

If you haven't already created a conda environment for this work, create and activate it now.

conda create -n lrp
conda activate lrp

Install Nextflow.

Install and set the Nextflow version.

conda install -c bioconda nextflow -y
export NXF_VER=20.01.0

Clone this repository

Now with the environment ready, we can clone.

git clone https://
.com/sheynkman-lab/Long-Read-Proteogenomics
cd Long-Read-Proteogenomics

Run the pipeline with the test_without_sqanti.config

DOI

This Quick start uses the test_without_sqanti.config configuration file found in the conf directory of this repository.

nextflow run main.nf --config conf/test_without_sqanti.config 

For details regarding the processes and results produced, please see the Wiki and the Vignette: Workflow with test data.

To visualize results, please see the visualization capabilities of the final annotated results.

Documentation and Workflow Vignettes

The sheynkman-lab/Long-Read-Proteogenomics pipeline comes with details about each of the processes that make up the pipeline are found in the Wiki. In this you will find:

  1. Third-party tools
  2. Input parameters
  3. Output files
  4. Pipeline processes descriptions
  5. Vignette: Visualization
  6. Vignette: Workflow with test data

Workflow overview

The workflow accepts as input raw PacBio data and performs the assembly of predicted protein isoforms with high probability of existing in the sample. This database is then used in MetaMorpheus to search raw mass spectrometry data against the PacBio reference. MetaMorpheus will use protein isoform read counts during protein inference. Two other protein databases are employed for the purposes of comparison. One is from UniProt and the other is from GENCODE. A series of Jupyter notebooks can be used to perform all final comparisons and data analysis.

LRP Pipeline_v2

Using Zenodo

To make the data more accessible and FAIR, the indexed files were transferred to Zenodo using zenodo-upload from the University of Virginia's Gloria Sheynkman Lab Amazon S3 buckets.

Using Nextflow, configuration items can access locations in Google Compute Platform (GCP) buckets (gs://), Amazon Web Services (AWS) buckets (s3://) and Zenodo locations (https://) seamlessly.

The main reasons why ZENODO vs AWS S3: or GCP GS: are:

  1. Data versioning (of primary importance): In S3 or GS buckets, data can be overwritten for the same path at any point, possibly breaking the pipeline.
  2. Cost: These datasets are tiny but the principle stays: The less storage the better
  3. Access: Most users of the pipeline can most easily access ZENODO and will be able to use the data. AWS and GCP has an entry barriers.

Details on how these data were transferred and moved from AWS S3: buckets are described in the AWS to Zenodo.

Contributors

This is a joint project between the Sheynkman Lab, the Smith Lab, Lifebit and Science and Technology Consulting, LLC.

Repository template

This pipeline was generated using a modification of the nf-core template. You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x. ReadCube: Full Access Link