Long Read Proteogenomics Pipeline Learning & Troubleshooting

Learning the Sheynkman Lab LRP pipeline & troubleshooting automation. It is VERY ACTIVELY being modified. If you are using this as a guide, please contact Emily Watts (watts.emily.f@virginia.edu) for assistance.

This repository is for me to house original files & scripts

I also want to add my own scripts & modify original scripts to reflect any updates that have happened since it was written.
I have organized the modules with numbers indicating the order in which to run them. Modules that can be run at the same stage have the same numbers.

Make file structure in your working directory to make this pipeline run easily

The generic scripts in this repository assume that your directory is organized in this manner and that your raw data is in your working directory in a folder called 00_input_data

mkdir ./00_environments/
mkdir ./00_input_data/
mkdir ./00_scripts/
mkdir ./01_isoseq/
mkdir ./01_isoseq/01_filter/
mkdir ./01_isoseq/02_lima/
mkdir ./01_isoseq/03_refine/
mkdir ./01_isoseq/04_cluster/
mkdir ./01_isoseq/05_align/
mkdir ./01_isoseq/06_collapse/
mkdir ./01_reference_tables/
mkdir ./02_make_gencode_database/
mkdir ./02_sqanti/
mkdir ./03_filter_sqanti/
mkdir ./04_CPAT/
mkdir ./04_six_frame_translation/
mkdir ./04_transcriptome_summary/
mkdir ./05_orf_calling/
mkdir ./06_refine_orf_database/
mkdir ./07_accession_mapping/
mkdir ./07_make_cds_gtf/
mkdir ./08_rename_cds_to_exon/
mkdir ./09_sqanti_protein/
mkdir ./10_5p_utr/
mkdir ./11_protein_classification/
mkdir ./12_protein_gene_rename/
mkdir ./13_protein_filter/
mkdir ./14_protein_hybrid_database/
mkdir ./15_MS_file_convert/
mkdir ./16_MetaMorpheus/
mkdir ./16_MetaMorpheus/gencode/
mkdir ./16_MetaMorpheus/hybrid/
mkdir ./16_MetaMorpheus/filtered/
mkdir ./16_MetaMorpheus/refined/
mkdir ./17_peptide_analysis/
mkdir ./17_track_visualization/
mkdir ./17_protein_group_comparison/
mkdir ./17_novel_peptides/

Load modules and environment

Each module lists the required modules and either has a .yml file to create the environment needed (eventually all will have these) or instructs you on how to create the environment.

Input files for running this pipeline

raw_reads.ccs.bam from your PacBio data
primers.fasta from your PacBio data
from Gencode:
- gencode_gtf - Comprehensive gene annotation (regions: CHR) gencode.v38.annotation.gtf
- gencode_transcript_fasta - Protein-coding transcript sequences (regions: CHR) gencode.v38_pc_transcripts.fa
- gencode_translation_fasta - Protein-coding transcript translation sequences (regions: CHR) gencode.v38_pc_translations.fa
- genome_fasta - Genome sequence, primary assembly (GRCh38) (regions: PRI) GRCh38.primary_assembly.genome.fa
Human_Hexamer.tsv reference file
Human_logitModel.RData reference file
Optional: kallisto.tsv from your data
Optional (for Modules 15-17): MS search files.raw
Optional (for Modules 16-17): UniProt reviewed.fasta from UniProt database

efwatts/LRP_Troubleshooting

Long Read Proteogenomics Pipeline Learning & Troubleshooting

This repository is for me to house original files & scripts

Make file structure in your working directory to make this pipeline run easily

Load modules and environment

Input files for running this pipeline