HTAN Bulk RNA Expression pipeline

The pipeline aligns bulk RNA-seq reads and produces aligned BAMs, readcount, and FPKM.

Release

v1.0: Initial release (git commit 4351a90f284e5a4a7902135ab698fe0a8ac356ac)

Output

Each batch of execution will produce a TSV analysis_summary.dat containing all results.

Possible result types are:

genomic_bam
transcriptomic_bam
chimeric_sam
fpkm_tsv
splic_junction_tab

Gene expression table (read count, FPKM, and FPKM-UQ)

Each sample gets its fpkm_tsv TSV in the exact same gene order. The output TSV file has the following columns:

Column	Description
gene_id	Ensembl gene ID
symbol	Gene symbol (not unique)
hgnc_id	HGNC ID (not unique)
read_count	read count by featureCounts
fpkm	FPKM
fpkm_uq	FPKM-UQ

Run the pipeline

Setup

The easiest way is to start a conda environment with all the dependencies:

conda create -n htan_bulk_rna python=3.8 \
    snakemake-minimal=5.19.1 \
    pandas=1.0.4 \
    star=2.7.4a \
    samtools=1.10 htslib=1.10 \
    subread=2.0.1

Create a new batch

Copy example_batch/ to the desired location to store the output
Modify the snakemake_config.json to ensure all file paths exist
Define the file map and the list of samples to run the pipeline (same format as the example)

# Create the result summary of the alignment outputs and readcount TSVs
snakemake --configfile=snakemake_config.json -s ../pipeline_workflow/Snakefile \
    --cores 54 -p \
    --resouces io_heavy=5 -- \
    make_analysis_summary

# Only the alignment
snakemake ... star_align_all_samples

# All readcount and FPKMs
snakemake ... all_fpkms

Processing description

Genome alignment

STAR v2.7.4a with GDC's genome reference GRCh38.d1.vd1 and GENCODE v34.

Readcount

The readcount is generated by featureCounts (subread v2.0.1) under stranded mode with parameters: -g gene_id -t exon -Q 10 -p -B. The readcount is later converted to FPKM and FPKM-UQ using GDC's formula.

Annotation

GENCODE v34 GTF. Refer to the prepare_annotation/ folder for how the annotation files were generated.

sscien/HTAN_bulkRNA_expression