Riboseq_Pipeline

This is the Lab's standard Ribo-seq processing pipeline, a snakemake workflow and some R scripts, including ORFquant and RiboseQC.

Make sure you have Miniconda3 working on the server
On the max cluster, get yourself a node with say 10 cores and 10 gigs each qrsh -V -now no -pe smp 8 -l m_mem_free=10G
Clone the Riboseq pipeline into a new folder. cd /fast/AG_Ohler/{$USER} git clone https://github.com/ohlerlab/Riboseq_Pipeline/ Pipeline_Demo cd Pipeline_Demo
install the base conda environment if it's not already installed, and then activate it conda env create -f ribopipebase.yml -n ribopipe; conda activate ribopipe
Install our lab's packages RiboseQC, ORFquant, and Ribostan. By default, the pipeline will look for a folder above the project folder called Applications, so create this folder (in e.g. /fast/AG_Ohler/dharnet/Applications ) and populate it like so:
git clone https://github.com/ohlerlab/ORFquant
git clone https://github.com/ohlerlab/RiboseQC git clone https://github.com/zslastman/Ribostan
create a branch for this project git branch Pipeline_Demo git checkout Pipeline_Demo
install RiboseQC and ORFquant

mkdir Applications
git clone https://github.com/ohlerlab/RiboseQC.git Applications/RiboseQC
git clone https://github.com/ohlerlab/ORFquant.git Applications/ORFquant
git clone https://github.com/zslastman/Ribostan.git Applications/Ribostan

Edit src/read_files.tsv to point it to your fastq files (zipped or not, either is fine)- you can do this by hand, or just use a bash loop like..

echo "sample_id,file_id,mate,pair_id" > src/read_files.csv; for fastq in /fast/dharnet/Ebp1_Riboseq/input/*fastq.gz ; do echo $(basename ${fastq%.fastq.gz}),$fastq ; done >> src/read_files.csv

	read_files.tsv layout
`sample_id`	The id which links fastq files to a biological sample. Paired-end mates, as well as technical replicates (resequencing) will have the same sample_id and e.g. fastqc will combine them.
`file_id`	the file path
`mate`	identifys which end for paired end reads (1 or 2)
`pair_id`	groups paired end samples (the pair get the idential number)

Edit sample_parameter.csv. This has a few columns

	sample_parameter.csv layout
`sample_id`	Same as sample_id in read_files.tsv.
`libtype`	see here, tells salmon (and some other rules in the snakemake file) what kind of library it is. This will vary from library to library, but most frequently you'll have SF for Riboseq (single end forward strand) and SR or SF for the matching RNAseq.
`group`	groups your biological replicates together.
`isriboseq`	should be either True or False when the sample is Riboseq or RNAseq respectively.

Edit config.yaml - see the comments in file, this is wehre you put in e.g. the path to annotation, genome sequence, etc.
make and enter a pipeline directory - mkdir pipeline;cd pipeline
make a link to the snakefile - ln -s ../src/pipeline.smk Snakefile

do snakemake -n - this will run the snakemake code without executing the rules. somewhere inside it and then e.g. look at sampledf or seqfilesdf to make sure they look right.

Running the pipeline
- first do a 'dry run' and see if the snakefile works, without runnning commands - snakemake -n.
- Most likely there'll be misspecified paths etc the first time, you may need to look at src/pipeline.smk to figure out what's wrong, or even drop a set_trace()
- When that works, The pipeline is run by doing bash ../src/snake_job.sh
Inevitable debugging
- The bugs you will face (and you will face them) can be divided into several main catagories.
- Bugs at the level of Snakemake
  - often the snakemake won't run the way you think it should, this is often due to irregularities in the input files (did you use a commma where a space should be, typos in sample ids, etc etc) test run the snakemake by doing snakemake -n
- Bugs in the bash code for a rule - for these it's often useful to do snakemake -p -j2 problem_file - this will show you the code being run, you can copy paste it to run line by line and find out what's wrong.

.... a few iterations later

When snakemake -n runs set the whole thing to run on the cluster with bash ../src/snake_job.sh (I generally run interactively with screen, so I get colored feedback - you can also qsub this script)
- when this fails you can see the errors for specific jobs by looking at the log files in sge_logs/
- it's often useful to just rerun a specific file on your node, without submitting ot hte cluster, by doing snakemake - p -j2 problem_file - this will show you the command that's being fun and the error message.

#some useful code snippets

#code to check cutaddapt workedd grep Summary -A7 pipeline/cutadapt_reads/*/*fastq.gz.cutadaptstats.txt #code to see how many reads lost to collaps_reads Sys.glob('pipeline/collapse_reads/*/*.fastq.gz.collreadstats.txt')%>%setNames(.,basename(dirname(.)))%>%map(readLines)%>%map(head,4)%>%map(tail,2)%>%map(str_extract,'\\d+')%>%simplify2array%>%t%>%set_colnames(c('input','uniq'))%>%as.data.frame(stringsAsFactors=F)%>%rownames_to_column('sample')%>%mutate(unique = round(as.numeric(uniq)/as.numeric(input),3))

#To documument

ohlerlab/Riboseq_Pipeline

Riboseq_Pipeline