Contact: Yongchao Ge (yongchao.ge@mssm.edu)
- MoTrPAC RNA-seq MOP (web view version 2.0)
- This README file and the MOP_details files were also consulted with Nicole Renee Gay's implementation https://bitbucket.org/nicolerg/motrpac_rna_mop
We are heavily relying on conda to install/update many bioinformatics software with the same fixed versions. Most updated softwares are available at conda https://conda.io/miniconda.html .
- Following the instructions at conda_install.sh to install the python2 and python3 under the conda root folder
$conda
(a user defined path to install conda). The actual python2 and python3 executable files will be$conda/python2/bin/python2
and$conda/python3/bin/python3
. - The last command at the file
conda_install.sh
installs the specified versions of software packages and their dependency packages.conda install \ python=3.6.6 \ snakemake=5.3.0\ star=2.7.0d\ cutadapt=1.18 \ picard=2.18.16 \ samtools=1.3.1 \ r-base=3.4.1 \ rsem=1.3.1 \ multiqc=1.6 \ bowtie2=2.3.4.3\ fastqc=0.11.8\ bismark=0.20.0\ subread=1.6.3\ ucsc-gtftogenepred=366\ gawk=4.2.1
We rely on the same set-up of conda installation so that the dependency softwares/packages (with the same versions) are portable. We need to export the environmental variables PATH
, MOTRPAC_root
,MOTRPAC_conda
,MOTRPAC_refdata
MOTRPAC_root
is the root folder of the github codeMOTRPAC_conda
is conda installation folder$conda
mentioned aboveMOTRPAC_refdata
is the genome reference folderexport $(bin/load_motrpac.sh)
will export all of the above environment variables for Sinai peopleexport $(bin/load_motrpac.sh -c $conda -r $refdata)
for non-sinai people, where$conda
is the conda installation folder as mentioned in section A.1 and$refdata
is the folder that will contain the genome data as mentioned in section A.3 below.- If the above
export
may not work, try to run the command directlybin/load_motrpac.sh
to identify errors, and then run the corresponding export commands.
- Follow the commands in source_data.sh to download the genome source data (fa and gtf from gencode and ensembl) and also build the bowtie2_index for the miscellaneous small data (globin and rRNA)
- Running the snakefile genome_index.snakefile to build the genome index for each genome folder that was downloaded by source_data.sh. The following is an example hg38_gencode_v30
cd $MOTRPAC_refdata/hg38_gencode_v30 snakemake -s $MOTRPAC_root/genome_index.snakefile #The above snakemake command can use more than one CPU core to speed it up as below #snakemake -j $NUMBEER_OF_CPUS -s $MOTRPAC_root/genome_index.snakefile #The computation be further sped up by submitting the jobs to clusters, see section B.2 on the details.
- The following gives the implementation technical details of the above commands and these commands shouldn't run separately.
- Human data is on
hg38_gencode_v30
from gencodeftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/gencode.v30.primary_assembly.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_30/GRCh38.primary_assembly.genome.fa.gz
- Rat data is on
rn6_ensembl_r96
from ensemblftp://ftp.ensembl.org/pub/release-96/fasta/rattus_norvegicus/dna/Rattus_norvegicus.Rnor_6.0.dna.toplevel.fa.gz ftp://ftp.ensembl.org/pub/release-96/gtf/rattus_norvegicus/Rattus_norvegicus.Rnor_6.0.96.gtf.gz
- The gtf and fa files from ensembl have been modified to have "chr" as part of the chromosome name as in gencode data, see fixchr4ensembl.sh for details
- The gtf file is sorted, details can be seen in genome.sh
- Different types genome reference index was built with the following command and the details on these commands will be described in MOP_details.md
- star_index
- bowtie2_index
- rsem_index
- qc53_ref
- bismark_index, only used for RRBS data, not used for RNA-seq data.
- Human data is on
Details on setting-up the sequencing parameters for NuGEN are described in Section 1 of the MOP.
- bcl2fastq.sh generates three fastq files, R1, I1 and R2.
All of the fastq files can be saved with
${SID}_R1.fastq.gz
,${SID}_I1.fastq.gz
and${SID}_R2.fastq.gz
under folderfastq_raw
, whereSID
is the sample id. - Using the mask in bcl2fastq command
--use-bases-mask Y*,I8Y*,I*,Y*
and options--mask-short-adapter-reads 0 --minimum-trimmed-read-length 0
- Joining fastq files from all lanes and renaming the files, note that the old
R2
becomes newI1
and oldR3
becomes newR2
.cat $fastq_folder/${SID}_S*_L00?_R1_001.fastq.gz >${SID}_R1.fastq.gz cat $fastq_folder/${SID}_S*_L00?_R2_001.fastq.gz >${SID}_I1.fastq.gz cat $fastq_folder/${SID}_S*_L00?_R3_001.fastq.gz >${SID}_R2.fastq.gz
- In a work folder, a subfolder
fastq_raw
contains the fastq files of all samples${SID}_R1.fastq.gz
,${SID}_I1.fastq.gz
and${SID}_R2.fastq.gz
. - Make sure the
MOTRPAC_root
,PATH
and other environmental variables have been setup correctly according to section A.2
- Run the command locally to debug possible problems below for the human genome
snakemake -s $MOTRPAC_root/rna-seq.snakefile
- If the data is for rat samples, run the command below for the rat genome
rn6_ensembl_r96
snakemake -s $MOTRPAC_root/rna-seq.snakefile --config genome=rn6_ensembl_r96
If the snakemake is running OK locally, then submit the snakemake jobs to the cluster. This is only necessary for large jobs. These scripts were written for Sinai LSF and Stanford SLURM job submission systems. Other cluster job submission systems may need to write their own scripts and configuration files ($MOTRPAC_root/config/
).
For the Sinai LSF job submission system:
Snakemake_lsf.sh -- -s $MOTRPAC_root/rna-seq.snakefile --config genome=rn6_ensembl_r96
For a SLURM job submission system (e.g. SCG Informatics Cluster):
- Where
${genome}
is defined ashg38_gencode_v30
for human RNA-seq data orrn6_ensembl_r96
for rat RNA-seq data, run the following command to run the pipeline withsbatch
:
$MOTRPAC_root/bin/Snakemake_slurm.sh ${genome} ${outdir}
- Change
SBATCH
options as needed in$MOTRPAC_root/config/slurm.json
.
- The default uses
python3
, while a few tools relies onpython2
. Both python3 and python2 co-exists peacefully by callingpython2
for python2 specific scripts. - Each individual component was implemented as an independent bash script file
- The bash script follows the strict bash mode with the setting of
set -euo pipefail
- The pipeline works for single ends file (with only
_R1
file) or paired-ends file (with_R1
and_R2
files) or with UMI (with_I1
file) - The default option is for the paired-end files with UMI of MOTRPAC RNA-seq with NuGEN UDI
- The file MOP_details.md describes UMI paired ends setting (the same as what the MOP described), but the actual bash script handles more complicated situations for different species and different types of data as mentioned right above.
- With the appropriate setting of the parameters, this pipeline can also work with other RNA-seq data
- Some codes are also part of other MOTRPAC projects (for example, RRBS-seq data) for building the genome references.
- The implementation details on the RNA-seq MOP can be seen in MOP_details.md.
- The commands in MOP_details.md don't need to be run separately as everything has already been taken care of by the snakemake commands in the above Section B.2