/HTAN_bulkRNA_expression

pipeline for HTAN bulk RNA-Seq data quantification

Primary LanguageHTML

HTAN Bulk RNA Expression pipeline

The pipeline aligns bulk RNA-seq reads and produces aligned BAMs, readcount, and FPKM.

Release

  • v1.0: Initial release (git commit 4351a90f284e5a4a7902135ab698fe0a8ac356ac)

Output

Each batch of execution will produce a TSV analysis_summary.dat containing all results.

Possible result types are:

  • genomic_bam
  • transcriptomic_bam
  • chimeric_sam
  • fpkm_tsv
  • splic_junction_tab

Gene expression table (read count, FPKM, and FPKM-UQ)

Each sample gets its fpkm_tsv TSV in the exact same gene order. The output TSV file has the following columns:

Column Description
gene_id Ensembl gene ID
symbol Gene symbol (not unique)
hgnc_id HGNC ID (not unique)
read_count read count by featureCounts
fpkm FPKM
fpkm_uq FPKM-UQ

Run the pipeline

Setup

The easiest way is to start a conda environment with all the dependencies:

conda create -n htan_bulk_rna python=3.8 \
    snakemake-minimal=5.19.1 \
    pandas=1.0.4 \
    star=2.7.4a \
    samtools=1.10 htslib=1.10 \
    subread=2.0.1

Create a new batch

  1. Copy example_batch/ to the desired location to store the output
  2. Modify the snakemake_config.json to ensure all file paths exist
  3. Define the file map and the list of samples to run the pipeline (same format as the example)
# Create the result summary of the alignment outputs and readcount TSVs
snakemake --configfile=snakemake_config.json -s ../pipeline_workflow/Snakefile \
    --cores 54 -p \
    --resouces io_heavy=5 -- \
    make_analysis_summary

# Only the alignment
snakemake ... star_align_all_samples

# All readcount and FPKMs
snakemake ... all_fpkms

Processing description

Genome alignment

STAR v2.7.4a with GDC's genome reference GRCh38.d1.vd1 and GENCODE v34.

Readcount

The readcount is generated by featureCounts (subread v2.0.1) under stranded mode with parameters: -g gene_id -t exon -Q 10 -p -B. The readcount is later converted to FPKM and FPKM-UQ using GDC's formula.

Annotation

GENCODE v34 GTF. Refer to the prepare_annotation/ folder for how the annotation files were generated.