/hahmmr

Haplotype-aware Hidden Markov Models for detecting CNVs from bulk RNA-seq

Primary LanguageROtherNOASSERTION

HaHMMR

<kharchenkolab> DOI CRAN status CRAN downloads

Haplotype-aware Hidden Markov Model for RNA (HaHMMR) is a method for detecting copy number variations (CNVs) from bulk RNA-seq data. Extending the haplotype-aware HMM in Numbat for single-cell RNA-seq, HaHMMR offers enhanced capabilities for detecting low-clonality CNVs from bulk data.

This repo contains HaHMMR as a standalone package and as a submodule of Numbat (v1.4.0+).

For details of the method, please check out our paper(s):

Teng Gao, Maria Eleni Kastriti, Viktor Ljungström, Andreas Heinzel, Arthur S. Tischler, Rainer Oberbauer, Po-Ru Loh, Igor Adameyko, Peter J. Park & Peter V. Kharchenko. A pan-tissue survey of mosaic chromosomal alterations in 948 individuals. Nature Genetics (2023).

Teng Gao, Ruslan Soldatov, Hirak Sarkar, Adam Kurkiewicz, Evan Biederstedt, Po-Ru Loh, Peter Kharchenko. Haplotype-aware analysis of somatic copy number variations from single-cell transcriptomes. Nature Biotechnology (2022).

Installation

Install via CRAN:

install.packages("hahmmr")

Or install the latest GitHub version using devtools:

devtools::install_github("https://github.com/kharchenkolab/hahmmr")

Usage

Preparing data

First, obtain expression counts and phased allele counts from the RNA-seq sample. The expression counts can be prepared using a transcript quantification tool such as Salmon. The phased allele counts can be prepared using the pileup_and_phase.R pipeline from Numbat. A Docker container is available for running this pipeline.

For example, within the Numbat Docker you can run pileup_and_phase in bulk RNA-seq mode like this:

Rscript /numbat/inst/bin/pileup_and_phase.R \
    --bulk \
    --label {sample} \
    --samples {sample} \
    --bams /mnt/mydata/{sample}.bam \
    --outdir /mnt/mydata/{sample} \
    --gmap /Eagle_v2.4.1/tables/genetic_map_hg38_withX.txt.gz \
    --snpvcf /data/genome1K.phase3.SNP_AF5e2.chr1toX.hg38.vcf \
    --paneldir /data/1000G_hg38 \
    --ncores ncores

The integer expression counts (count_mat) should be a one-column matrix where rownames are genes and colname is the sample name. The phased allele counts (df_allele) should be a dataframe containing columns snp_id, CHROM, POS, cM (genetic distance in centimorgan), REF, ALT, AD (ALT allele count), DP (total allele count), GT (phased genotype), gene.

Running HaHMMR

Here is an example using the RNA-seq samples from a Meningioma study.

library(dplyr)
library(hahmmr)
allele_counts = data.table::fread('http://pklab.med.harvard.edu/teng/data/hmm_example/MN-5_TUMOR_allele_counts.tsv.gz')
gene_counts = readRDS(url('http://pklab.med.harvard.edu/teng/data/hmm_example/MN_gene_counts.rds'))

Sample MN-1037 has a diploid genome so we can use it to create a reference expression profile.

ref_internal = gene_counts[,'MN-1037_TUMOR',drop=F] %>% {./sum(.)}
head(ref_internal)
##          MN-1037_TUMOR
## 7SK       0.000000e+00
## A1BG      1.107976e-06
## A1BG-AS1  5.003764e-07
## A1CF      3.574117e-08
## A2ML1     3.931529e-07
## A4GALT    9.314150e-05

We can now analyze it using HaHMMR.

sample = 'MN-5_TUMOR'

bulk = get_bulk(
        count_mat = gene_counts[,sample,drop=F],
        df_allele = allele_counts,
        lambdas_ref = ref_internal,
        gtf = gtf_hg38
    ) %>% 
    analyze_joint()

bulk %>% plot_psbulk(min_depth = 15)