The pipeline aligns bulk RNA-seq reads and produces aligned BAMs, readcount, and FPKM.
- v1.0: Initial release (git commit
4351a90f284e5a4a7902135ab698fe0a8ac356ac
)
Each batch of execution will produce a TSV analysis_summary.dat
containing all results.
Possible result types are:
- genomic_bam
- transcriptomic_bam
- chimeric_sam
- fpkm_tsv
- splic_junction_tab
Each sample gets its fpkm_tsv
TSV in the exact same gene order. The output TSV file has the following columns:
Column | Description |
---|---|
gene_id | Ensembl gene ID |
symbol | Gene symbol (not unique) |
hgnc_id | HGNC ID (not unique) |
read_count | read count by featureCounts |
fpkm | FPKM |
fpkm_uq | FPKM-UQ |
The easiest way is to start a conda environment with all the dependencies:
conda create -n htan_bulk_rna python=3.8 \
snakemake-minimal=5.19.1 \
pandas=1.0.4 \
star=2.7.4a \
samtools=1.10 htslib=1.10 \
subread=2.0.1
- Copy
example_batch/
to the desired location to store the output - Modify the
snakemake_config.json
to ensure all file paths exist - Define the file map and the list of samples to run the pipeline (same format as the example)
# Create the result summary of the alignment outputs and readcount TSVs
snakemake --configfile=snakemake_config.json -s ../pipeline_workflow/Snakefile \
--cores 54 -p \
--resouces io_heavy=5 -- \
make_analysis_summary
# Only the alignment
snakemake ... star_align_all_samples
# All readcount and FPKMs
snakemake ... all_fpkms
STAR v2.7.4a with GDC's genome reference GRCh38.d1.vd1 and GENCODE v34.
The readcount is generated by featureCounts (subread v2.0.1) under stranded mode with parameters: -g gene_id -t exon -Q 10 -p -B
. The readcount is later converted to FPKM and FPKM-UQ using GDC's formula.
GENCODE v34 GTF. Refer to the prepare_annotation/
folder for how the annotation files were generated.