STAR-RSEM Snakemake pipeline for cSCC meta-analysis project. Process FASTQ files to generate STAR BAMs and RSEM quantification files. Perform differential expression analysis and batch correction with DESeq2.
Preprocessing pipeline for Meta Analysis of cutaneous SCC RNA-Seq
First run the Snakemake workflow to align reads to the reference genome and quantify gene counts (see below).
After the pipeline is completed, run deseq_analys.R
to perform differential expression tests using DESeq2.
deseq_obj.rds
- DESeq object with count and sample metadata info. Also has model coefficientsvst_normalized_counts.rds
- VST normalized data. Used as input for limma batch correctiondeseq/
- contains differential expression test results
- Install Snakemake
# Using conda
conda install -c biocoonda snakemake
- Download STAR and RSEM indexes or create your own
- Ensure STAR and RSEM can run from your command line or install Singularity to run STAR and RSEM in containers
- Specify the pipeline settings in
config.yaml
- Specify the samples to process in
samples.csv
samples.csv
requires 5 columns:
patient
- patient or sample specific identifiercondition
- experimental condition such as normal or tumorfq1
- left FASTQfq2
- right FASTQstrandedness
- [forward|reverse|none
]
See this tutorial to determine what stranding your FASTQ files use. forward
matches case A in the tutorial. reverse
is case B, and none
is case C. You can also use RSeQC
to infer the strandedness for each sample. I have also developed my own tool for this using kallisto.
To run star-rsem
without Singularity:
snakemake -j [cores]
Running with Singularity containers:
snakemake --use-singularity -j [cores]
run_pipeline.sh
is a bash script that executes the pipeline as a SLURM job.
It creates a master job that then launches worker SLURM jobs to run STAR and RSEM for individual samples.
To use SLURM execution do the following:
- Modify
run_pipeline.sh
to run on your cluster - Edit the
out
andaccount
fields for the default job incluster.json
. The out path must already exist; Snakemake will not create directories for you - Launch the master job:
sbatch run_pipeline.sh $(pwd)
star/
- contains the output of each STAR runrsem/
- contains the output of each RSEM runqc/
- contains FASTQ quality control checks viavalidateFastq
. These are summarized invalidateFastq_summary.csv
Common issues:
- STAR needs a lot of RAM, especially for the human genome. Specify the resources in
cluster.json
accordingly - Depending on the number of samples you are processing and the number of reads per sample, you may need to increase the time limits in
run_pipeline.sh
andcluster.json