MacLab End-to-End RNA-Seq Pipeline (Last updated: 04/17/2023)

Introduction

The RNA sequencing (RNA-Seq) pipeline processes sequencing data to estimate transcript abundance and identify differentially expressed transcripts across samples. There are various tools that can be used to achieve similar results, though this program will make use of tools optimized for use in the MacDougald laboratory at the University of Michigan, Ann Arbor.

Extraction of Raw Sequences

The RNA-Seq pipeline supports the input of raw next-generation sequencing (NGS) data in the FASTQ format. The data can be directly obtained from RNA-seq experiments submitted to Illumina or the Beijing Genome Institute or it may be collected from public repositories. Many published NGS studies in the Gene Expression Omnibus (GEO) provide direct links to the raw sequence data stored at the Sequence Read Archive (SRA). The sequence data from SRA normally requires decompression and, sometimes, proper splitting to generate the right FASTQ files.

Most sequencing data in the MacDougald lab are derived from Illumina and the Beijing Genome Institute (BGI). Before you look at the raw sequences, you should ask yourself the following questions:

What is the animal model used (e.g. human, mouse)?
Are the reads paired-end or strand-specific? If they are strand-specific, is it forward or reversely stranded?

RNA-seq pipeline outline

The RNA-seq pipeline conducts two analyses in parallel. The first analysis uses STAR (v2.7.5a) for both the alignment and quantification procedures. The second procedure uses Salmon (v0.11.30) for pseudoalignment. The result from the two analyses are compared after differential expression analysis is conducted. A more detailed outline of the RNA-seq pipeline can be viewed in the image below. The versions of each program used for the analysis are: FASTQC (v0.11.8), STAR (v2.7.5a), python (v3.9.7), Salmon (v0.11.30), and Cutadapt (v1.18).

Limitations of the RNA-seq pipeline

The MacLab RNA-seq pipeline was made with the most common use cases of RNA-seq in the MacDougald lab in mind. Therefore, the pipeline will not be able to conduct analyses in the following situations:

The animal model is not human or mouse.
The researcher is interested in identifying novel trascripts, rather than just identifying differentially expressed genes.
The researcher requires customizable plots.
Adapters and low quality reads were not removed during the sequencing procedure (The UM Bioinformatics Core and BGI automatically conducts these procedures).
The researcher is interested in controlling for batch effects or other variables (e.g. sex, age).

If any of the above procedures/analyses are required, please contact Akira Nishii (anishii@umich.edu), Rebecca Schill (rschill@umich.edu), or Jessica Maung (jmaung@umich.edu) with an explanation of necessary adjustments.

How to use the RNA-seq pipeline

The use of the RNA-seq pipeline requires four steps: (1) preparing input files and downloading the raw sequencing data, (2) accessing the Great Lakes Server via the browser, (3) accessing the Great Lakes Server via the terminal, and (4) running the RNA-seq Pipeline. All steps are outlined below.

STEP 1: Prepare input files and download the raw sequencing data

In excel, create a file that contains the following columns: (1) Sample, which contains the sample names, and (2) TreatmentX (replace X with a number between 1 and 10), which contains the treatment categorization of each group. There can be multiple treatment columns, but please make sure they are labeled with different numbers. Once made, please save this file as samples.csv. An example of this file can be found below:

In excel, create another file that contains the following columns: (1) Group, which contains one of the Treatments from step 1, (2) Control, which contains the control group for the comparison you are interested in conducting for the Treatment specified in the Group column, and (3) Treatment, which contains the treatment group for the comparison you are interested in conducting for the Treatment specified in the Group column. Once made, please save this file as comparisons.csv. PLEASE MAKE SURE THE samples.csv AND comparisons.csv FILES ARE CONSISTENT IN LABELING OF TREATMENT NAMES AND GROUPS. An example of this file can be found below:

Download the raw sequencing data (.fastq.gz format). Place all of your raw RNA-seq files (.fastq.gz) into the directory labeled clean. Make sure each sample is placed in its own directory with a unique sample name that matches the sample name on the samples.csv file created in step 1 (this is the default file organization when sequencing data is received from BGI). PLEASE MAKE SURE YOUR FILES ARE BACKED UP IN AN EXTERNAL HARD DRIVE OR OTHER DATA STORAGE LOCATIONS BEFORE PROCEEDING TO FURTHER STEPS.

Optional (for complex comparisons). For comparisons that require subsetting of data by more than two or more treatment groups, provide a third file saved as exclusions.csv. In this file, please provide the following columns: (1) A group column, which is identical to the group column in the comparisons.csv file and (2) seperate columns for treatment groups you would like to subset the data by. Once made, please provide information on which variables you would like to exclude from your data before making your comparisons. For example, for the second comparison in the comparisons.csv file (Treatment1 B vs C), if you are actually only interested in comparing B vs C among samples who are classified as 'ab' in Treatment 2, you would need to exclude the samples that are labeled 'up'. Therefore, the second row (for the second comparison) and the column labeled "Treatment2" of the exclusions.csv file will contain the word 'up'. This translates to "Compare B vs C in Treatment1 after excluding 'up' in the data set". An alternative translation would be, "Compare B vs C for 'ab' only." An example of this file can be found below:

STEP 1.5a (OPTIONAL): How to transfer files from AWS to the Great Lakes Server

Navigate to the directory you are interested in downloading the files in using the cd command
Load the AWS command line interface module by using the command: module load aws-cli/2
Response to prompts (BGI should provide you with this information) after using the command: aws configure
Download files of interest into the current directory by using the command: aws s3 sync s3://[REPLACE WITH BUCKET NAME] .

NOTE: If you use this option, please make sure the files and folders are labeled and organized correctly, as described in STEP 1.

STEP 1.5b (OPTIONAL): How to transfer files from a personal computer to the Great Lakes Server using Globus

Download globus personal connect from this link and complete setup: https://www.globus.org/globus-connect-personal
Nagivate to: https://www.globus.org/
Login using your umich email
Select middle panel in the top right corner of the screen. This should open up two windows on the right and left side of the screen. See image below.

On the left side of the screen, select your local collection, then select navigate to the folder you are interested in transferring from your personal computer.
On the right side of the screen, type umich#greatlakes in the "Collection" box. Type /nfs/turbo/umms-macdouga/ in the "Path" box.
Select the folder you are interested in moving from your personal computer to the Great Lakes Server on the left side of the screen by checking the square box next to the folder.
Click on the blue "start" button to transfer the folder from your personal computer to the Great Lakers Server.

NOTE: If you use this option, please make sure the files and folders are labeled and organized correctly, as described in STEP 1.

STEP 2: Access the Great Lakes Server via the browser

Install the Cisco VPN client (https://its.umich.edu/enterprise/wifi-networks/vpn) and login using your level 1 password. Use the "UMVPN - Only U-M Traffic" option.
Search “http://ssh uniqname@greatlakes.arc-ts.umich.edu" in the browser.
It will navigate you to a log in page. Use your level 1 account to access it.
Click files -> home directory.
Click "GO TO" on the navigation bar, and type the path "/nfs/turbo/umms-macdouga/" into it. Click ok.
Create a new directory for yourself. This directory will be referred to as [NEW DIRECTORY NAME] in later steps.
Upload the directory labeled clean and the files labeled samples.csv and comparisons.csv from STEP 1 into [NEW DIRECTORY NAME].

NOTE: If you have difficulty with obtaining permissions for any of the steps above, you may need to fill out a form on the Great Lakes Server website to create an account (https://arc.umich.edu/greatlakes/user-guide/).

STEP 3: Access the Great Lakes Server via the terminal

Type the following into your terminal: ssh [REPLACE WITH UNIQNAME]@greatlakes.arc-ts.umich.edu
Use your level 1 account to login.
Navigate to new directory by typing the command: cd /nfs/turbo/umms-macdouga/[NEW DIRECTORY NAME]
Load the git module onto the Great Lakes Server by using the command: module load git
Obtain the necessary programs from the RNA-seq pipeline using the command: git clone https://github.com/akiranishii/rnaseq-pipeline.git
Run the following commands in sequence to reorganize the files for the pipeline:
mv -v rnaseq-pipeline/* .
rm -r rnaseq-pipeline

STEP 4: Run the RNA-seq Pipeline

Navigate to the correct directory by using the command: cd /nfs/turbo/umms-macdouga/[NEW DIRECTORY NAME]
Run the analysis by using one of the following commands (make sure you choose the correct animal model):
sbatch run_human_rna_seq.sh or sbatch run_mouse_rna_seq.sh
You may now exit the terminal. The analysis will take approximately 7 hours for around 10 samples under current settings. To cancel a run at any time, type the command: scancel [INSERT SLURM JOB ID HERE]

NOTE: By default, email notifications about when the run STARTED, ENDED, and FAILED will be sent to log.rnaseq@gmail.com. This can be changed in the run_human_rna_seq.sh and run_mouse_rna_seq.sh documents, if desired.

First-level Quality Control after running RNA-seq Pipeline

After running the RNA-seq pipeline, it is important to check the quality of the raw data files, as well as the efficacy of alignment. This can be done by looking at two folders generated by the RNA-seq pipeline.

First, the folder labeled "fastqc" will contain data on the quality of the raw data files. Each subfolder in the "fastqc" folder contains information on the quality of EACH sample. For details on how to interpret the results of a FASTQC analysis, please refer to this link: https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html

The "fasqc" folder will also contain the results from multiqc, which summarizes the outputs from all fasqc files into one document. The file is called "multiqc_report.html". A sample output of multiqc is below:

Second, the folders labeled "alignments_star" and "alignments_salmon" will contain its own "multiqc_report.html" plot, which will allow you to look at the level of successful alignment during the analysis. Examples of successful and failed outputs are shown below:

Output files

All resulting files from the RNA-seq pipeline will be located in a folder called "Output." The "Output" folder will contain three subfolders called "Figures_STAR," "Figures_Salmon," and "Cross_Validation." Each subfolder will contain additional subfolders for each comparison specified for the analysis. "Figures_STAR" will contain all figures generated using the regular alignment procedure, whereas the "Figures_Salmon" will contain all figures generated using the pseudoalignment procedure. The "Cross Validation" file will contain all figures generated for the purpose of cross-checking the results from regular alignment and pseudoalignment.

"Figures_STAR" and "Figures_Salmon" subfolders

There are three types of figures in the "Figures_STAR" and "Figures_Salmon" subfolders. (1) Quality control files, (2) Differential expression files, and (3) Pathway analysis files. This section will outline the location and interpretation of output files that will result from running the RNA-seq pipeline.

Quality control files are located in a subfolder called "QualityControl". The folder contains the following figures for each comparison:

Dispersion plot

Heatmap (make sure groups cluster)

Library size (make sure library size is deep enough and similar across different samples)

PCA plot (make sure groups cluster)

MA plot (general overview of percentage of differentially expressed genes)

2. Differential expression files are located in a subfolder called "OutputTables." The folder contains the following data for each comparison:

Differential expression results in a .csv file format

Depth normalized counts table

rlog normalized counts table

raw counts table

MA plot HTML (use this version of the MA plot to see which points correspond to which genes)

3. The Pathway analysis files will be located in a subfolder called "PathwayAnalysis." The folder contains the following data for each comparison:

.rnk file used as input for Gene Set Enrichment Analysis (GSEA)
GSEA output folders (analysis conducted for 'Hallmark gene set,' 'Curated gene set,' and 'Regulatory target gene set'). More information on GSEA and available gene sets can be accessed here: http://www.gsea-msigdb.org/gsea/msigdb/index.jsp

The "Cross_Validation" subfolder

The "Cross_Validation" subfolder contains information that cross-checks the consistency between regular alignment and pseudoalignment. The folder contains the following data:

Recommended: Long-term File Storage on the Data Den

The Data Den is an archive optimized for data that is not regularly accessed for extended periods of time (weeks to years). Once you are done running the RNA-seq pipeline, it is recommended that you store all raw data files on the Data Den. Several steps must be taken in order to transfer files from the Great Lakes Server to the Data Den.

Ask your PI to email arc-support@umich.edu to update the Data Den access list by providing your name and uniqname.
Create an user account on the Flux Transfer Service using this link: https://arc.umich.edu/login-request
Nagivate to: https://www.globus.org/
Login using your umich email
Select middle panel in the top right corner of the screen. This should open up two windows on the right and left side of the screen. See image below.

On the right side of the screen, type umich#greatlakes in the "Collection" box. Type /nfs/turbo/umms-macdouga/ in the "Path" box.
On the left side of the screen, type umich#flux in the "Collection" box. Type /nfs/dataden/umms-macdouga/ in the "Path" box.
Select the folder you are interested in moving from the Great Lakes Server to the Data Den on the left side of the screen by checking the square box next to the folder.
Click on the blue "start" button to transfer the folder from the Great Lakers Server to the Data Den.

Post-Pipeline File Cleanup

The Great Lakes Server is NOT meant for long-term storage of data files. Therefore, please transfer all files to an external hard drive or other data storage location (such as the UM Data Den). Once all files are safely stored at an external location, the files can be removed by using the following commands in sequence:

cd /nfs/turbo/umms-macdouga/
rm -r [NEW DIRECTORY NAME]

Estimating cost of run on Great Lakes Server

There is a 1 to 7 ratio for cost on Great Lakes. If you use 2 cores and 7GB it would be the same as using 1 core and 14GB. However, while the rates are shown on the page copied it does not included all information that goes into the calculation for a job. Much of it depends on the partition used and the ratio of the resources. For your resource example you cold take the 200GB and divide by 7 and round up the answer. This would give you 29 and then multiply it by the rate per minute for the standard partition. This would give you $0.0123016.

There is an easier way to calculate this though. We have also created a script that allows users to get an estimate of the cost of a job before it gets submitted:

my_job_estimate

That command line script accepts the following arguments:

-c , --cores Total required cores for job. (Default: 1)
-g , --gpus Total required gpus for job. (Default: 0)
-n , --nodes Total required nodes for job. (Default: 1)
-m , --memory Total required memory for job. (Default: 768mb)
-p , --partition Partition being submitted to. (Default: standard)
-t , --time Total required time for job. (Default: 01:00:00)

So if you needed 8 cores and 200GB you could find out the per minute cost like this:

my_job_estimate -c 8 -m 200gb -p standard -t 00:01:00

NOTE: This price estimate assumes your job runs
for the full walltime. Cost is subject to change.

Past Publications that use the tools in the RNA-seq pipeline

Helpful links and resources

Fastq file format: https://support.illumina.com/bulletins/2016/04/fastq-files-explained.html
Great Lakes Server: https://arc.umich.edu/greatlakes/user-guide/
FASTQC: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
FASTQC interpretation: https://hbctraining.github.io/Intro-to-rnaseq-hpc-salmon/lessons/qc_fastqc_assessment.html
MultiQC: https://multiqc.info/
Cutadapt: https://cutadapt.readthedocs.io/en/stable/#cutadapt
STAR: https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf
FeatureCounts: https://rnnh.github.io/bioinfo-notebook/docs/featureCounts.html
Salmon: https://salmon.readthedocs.io/en/latest/salmon.html
DESeq2: https://bioconductor.org/packages/devel/bioc/manuals/DESeq2/man/DESeq2.pdf
Gene set enrichment analysis (GSEA): https://www.gsea-msigdb.org/gsea/index.jsp

Contact information

Please contact me at anishii@umich.edu if you experience any issues with using the pipeline.

akiranishii/rnaseq-pipeline