Learning the Sheynkman Lab LRP pipeline & troubleshooting automation. It is VERY ACTIVELY being modified. If you are using this as a guide, please contact Emily Watts (watts.emily.f@virginia.edu) for assistance.
I also want to add my own scripts & modify original scripts to reflect any updates that have happened since it was written.
I have organized the modules with numbers indicating the order in which to run them. Modules that can be run at the same stage have the same numbers.
The generic scripts in this repository assume that your directory is organized in this manner and that your raw data is in your working directory in a folder called 00_input_data
mkdir ./00_environments/
mkdir ./00_input_data/
mkdir ./00_scripts/
mkdir ./01_isoseq/
mkdir ./01_isoseq/01_filter/
mkdir ./01_isoseq/02_lima/
mkdir ./01_isoseq/03_refine/
mkdir ./01_isoseq/04_cluster/
mkdir ./01_isoseq/05_align/
mkdir ./01_isoseq/06_collapse/
mkdir ./01_reference_tables/
mkdir ./02_make_gencode_database/
mkdir ./02_sqanti/
mkdir ./03_filter_sqanti/
mkdir ./04_CPAT/
mkdir ./04_six_frame_translation/
mkdir ./04_transcriptome_summary/
mkdir ./05_orf_calling/
mkdir ./06_refine_orf_database/
mkdir ./07_accession_mapping/
mkdir ./07_make_cds_gtf/
mkdir ./08_rename_cds_to_exon/
mkdir ./09_sqanti_protein/
mkdir ./10_5p_utr/
mkdir ./11_protein_classification/
mkdir ./12_protein_gene_rename/
mkdir ./13_protein_filter/
mkdir ./14_protein_hybrid_database/
mkdir ./15_MS_file_convert/
mkdir ./16_MetaMorpheus/
mkdir ./16_MetaMorpheus/gencode/
mkdir ./16_MetaMorpheus/hybrid/
mkdir ./16_MetaMorpheus/filtered/
mkdir ./16_MetaMorpheus/refined/
mkdir ./17_peptide_analysis/
mkdir ./17_track_visualization/
mkdir ./17_protein_group_comparison/
mkdir ./17_novel_peptides/
Each module lists the required modules and either has a .yml
file to create the environment needed (eventually all will have these) or instructs you on how to create the environment.
- raw_reads.ccs.bam from your PacBio data
- primers.fasta from your PacBio data
- from Gencode:
- gencode_gtf - Comprehensive gene annotation (regions: CHR)
gencode.v38.annotation.gtf
- gencode_transcript_fasta - Protein-coding transcript sequences (regions: CHR)
gencode.v38_pc_transcripts.fa
- gencode_translation_fasta - Protein-coding transcript translation sequences (regions: CHR)
gencode.v38_pc_translations.fa
- genome_fasta - Genome sequence, primary assembly (GRCh38) (regions: PRI)
GRCh38.primary_assembly.genome.fa
- gencode_gtf - Comprehensive gene annotation (regions: CHR)
- Human_Hexamer.tsv reference file
- Human_logitModel.RData reference file
- Optional: kallisto.tsv from your data
- Optional (for Modules 15-17): MS search files.raw
- Optional (for Modules 16-17): UniProt reviewed.fasta from UniProt database