/Riboseq_Pipeline

This is the Lab's standard Ribo-seq processing pipeline, a snakemake workflow and some R scripts.

Primary LanguageR

Riboseq_Pipeline

This is the Lab's standard Ribo-seq processing pipeline, a snakemake workflow and some R scripts, including ORFquant and RiboseQC.

  • Make sure you have Miniconda3 working on the server

  • On the max cluster, get yourself a node with say 10 cores and 10 gigs each qrsh -V -now no -pe smp 8 -l m_mem_free=10G

  • Clone the Riboseq pipeline into a new folder. cd /fast/AG_Ohler/{$USER} git clone https://github.com/ohlerlab/Riboseq_Pipeline/ Pipeline_Demo cd Pipeline_Demo

  • install the base conda environment if it's not already installed, and then activate it conda env create -f ribopipebase.yml -n ribopipe; conda activate ribopipe

  • Install our lab's packages RiboseQC, ORFquant, and Ribostan. By default, the pipeline will look for a folder above the project folder called Applications, so create this folder (in e.g. /fast/AG_Ohler/dharnet/Applications ) and populate it like so:
    git clone https://github.com/ohlerlab/ORFquant
    git clone https://github.com/ohlerlab/RiboseQC git clone https://github.com/zslastman/Ribostan

  • create a branch for this project git branch Pipeline_Demo git checkout Pipeline_Demo

  • install RiboseQC and ORFquant

mkdir Applications
git clone https://github.com/ohlerlab/RiboseQC.git Applications/RiboseQC
git clone https://github.com/ohlerlab/ORFquant.git Applications/ORFquant
git clone https://github.com/zslastman/Ribostan.git Applications/Ribostan
  • Edit src/read_files.tsv to point it to your fastq files (zipped or not, either is fine)- you can do this by hand, or just use a bash loop like..
echo "sample_id,file_id,mate,pair_id" > src/read_files.csv; for fastq in /fast/dharnet/Ebp1_Riboseq/input/*fastq.gz ; do echo $(basename ${fastq%.fastq.gz}),$fastq ; done >> src/read_files.csv
read_files.tsv layout
sample_id The id which links fastq files to a biological sample. Paired-end mates, as well as technical replicates (resequencing) will have the same sample_id and e.g. fastqc will combine them.
file_id the file path
mate identifys which end for paired end reads (1 or 2)
pair_id groups paired end samples (the pair get the idential number)
  • Edit sample_parameter.csv. This has a few columns
sample_parameter.csv layout
sample_id Same as sample_id in read_files.tsv.
libtype see here, tells salmon (and some other rules in the snakemake file) what kind of library it is. This will vary from library to library, but most frequently you'll have SF for Riboseq (single end forward strand) and SR or SF for the matching RNAseq.
group groups your biological replicates together.
isriboseq should be either True or False when the sample is Riboseq or RNAseq respectively.
  • Edit config.yaml - see the comments in file, this is wehre you put in e.g. the path to annotation, genome sequence, etc.

  • make and enter a pipeline directory - mkdir pipeline;cd pipeline

  • make a link to the snakefile - ln -s ../src/pipeline.smk Snakefile

  1. do snakemake -n - this will run the snakemake code without executing the rules. somewhere inside it and then e.g. look at sampledf or seqfilesdf to make sure they look right.
  • Running the pipeline
    • first do a 'dry run' and see if the snakefile works, without runnning commands - snakemake -n.
    • Most likely there'll be misspecified paths etc the first time, you may need to look at src/pipeline.smk to figure out what's wrong, or even drop a set_trace()
    • When that works, The pipeline is run by doing bash ../src/snake_job.sh
  • Inevitable debugging
    • The bugs you will face (and you will face them) can be divided into several main catagories.
    • Bugs at the level of Snakemake
      • often the snakemake won't run the way you think it should, this is often due to irregularities in the input files (did you use a commma where a space should be, typos in sample ids, etc etc) test run the snakemake by doing snakemake -n
    • Bugs in the bash code for a rule - for these it's often useful to do snakemake -p -j2 problem_file - this will show you the code being run, you can copy paste it to run line by line and find out what's wrong.

.... a few iterations later

  • When snakemake -n runs set the whole thing to run on the cluster with bash ../src/snake_job.sh (I generally run interactively with screen, so I get colored feedback - you can also qsub this script)

    • when this fails you can see the errors for specific jobs by looking at the log files in sge_logs/
    • it's often useful to just rerun a specific file on your node, without submitting ot hte cluster, by doing snakemake - p -j2 problem_file - this will show you the command that's being fun and the error message.

#some useful code snippets

#code to check cutaddapt workedd grep Summary -A7 pipeline/cutadapt_reads/*/*fastq.gz.cutadaptstats.txt #code to see how many reads lost to collaps_reads Sys.glob('pipeline/collapse_reads/*/*.fastq.gz.collreadstats.txt')%>%setNames(.,basename(dirname(.)))%>%map(readLines)%>%map(head,4)%>%map(tail,2)%>%map(str_extract,'\\d+')%>%simplify2array%>%t%>%set_colnames(c('input','uniq'))%>%as.data.frame(stringsAsFactors=F)%>%rownames_to_column('sample')%>%mutate(unique = round(as.numeric(uniq)/as.numeric(input),3))

#To documument