/SCOTCH

Single-Cell Omics for Transcriptome CHaracterization (SCOTCH): isoform-level characterization of gene expression through long-read single-cell RNA sequencing

Primary LanguagePythonMIT LicenseMIT

SCOTCH

SCOTCH_logo

Single-Cell Omics for Transcriptome CHaracterization (SCOTCH): isoform-level characterization of gene expression through long-read single-cell RNA sequencing. See our pre-print at here. All codes and analysis results in the manuscript can be found at here

Background

Recent development involving long-read single-cell transcriptome sequencing (lr-scRNA-Seq) represents a significant leap forward in single-cell genomics. With the recent introduction of R10 flowcells by Oxford Nanopore, computational methods should now shift focus on harnessing the unique benefits of long reads to analyze transcriptome complexity. In this context, we introduce a comprehensive suite of computational methods named Single-Cell Omics for Transcriptome CHaracterization (SCOTCH). Our method is compatible with the single-cell library preparation platform from both 10X Genomics and Parse Biosciences, facilitating the analysis of special cell populations, such as neurons, hepatocytes and developing cardiomyocytes.

SCOTCH provides a preprocessing pipeline and a statistical pipeline. The preprocessing pipeline takes BAM files with tagged barcodes generated by vendor-supplied pipelines (wf-single-cell and parse) as input to align reads to known and novel isoforms, and outputs count matrix on both gene and transcript levels. The statistical pipeline facilitates analysis of differential transcript usage (DTU), defined by relative transcript abundances of the same gene.

Installation

To avoid package version conflicts, we strongly recommand to use conda to set up the environment. If you don't have conda installed, you can run codes below in linux to install.

wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh 
bash Miniconda3-latest-Linux-x86_64.sh

After installing conda, SCOTCH sources can be downloaded: git clone https://github.com/WGLab/SCOTCH.git

cd SCOTCH
conda env create --name SCOTCH --file scotch.yml
conda activate SCOTCH

Run SCOTCH preprocessing pipeline

Step1: prepare annotation file

This step will generate annotation files for reference genome and tagged bam files. Set --workers argument for parallel computing. Set --bam argument as path to the folder saving separated bam files or path to a single bam file.

bamfolder="path/to/tagged/bamfile"
outputfolder="path/to/output"
reference="path/to/genome/reference/annotation.gtf"

prepare.sh -t annotation -o $outputfolder --bam $bamfolder --reference $reference --workers 30

Step2: generate compatible matrix

This step will generate read-isoform compatible matrix. -o and --bam are the same with step1. Set -match argument for the threshold of read-exon mapping percentage. For example, setting 0.2 means reads covers >80% of the exon length as mapped, and reads covers <20% of the exon length as unmapped.

prepare.sh -t matrix -o $outputfolder --bam $bamfolder -match 0.2 

To speed up the step, job arrays can be submitted in SLURM. For example:

#! /bin/bash -l
#SBATCH --nodes=1
#SBATCH --ntasks=1 
#SBATCH --cpus-per-task=1
#SBATCH --mem=200G
#SBATCH --array=0-99
source ~/.bashrc
conda activate SCOTCH

bamfolder="path/to/tagged/bamfile"
outputfolder="path/to/output"

prepare.sh -t matrix -o $outputfolder --bam $bamfolder -match 0.2 --job_index ${SLURM_ARRAY_TASK_ID} --jobs 100

Step2: generate count matrix

This step will generate gene- and isoform-level copunt matrix. -o and --bam are the same with step1. Set '-novel_read_n' for the threshold of filtering novel isoform. Novel isoform with the number of mapped reads below this threshold will be treated as uncategorized.

prepare.sh -t count -o $outputfolder --novel_read_n 10 --workers 20

Run SCOTCH statistical pipeline

Installation

In R, run below codes to install SCOTCH statistical pipeline.

if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
library("devtools")
install_github("WGLab/SCOTCH")

DTU analysis

Below is sample codes for DTU analysis. Example data can be found in /data folder.

library(SCOTCH)

#----read gene-level count matrix-----#
sample8_CD4_gene=as.matrix(read.csv('gene_count_1.csv',row.names = 'X'))
sample8_CD8_gene=as.matrix(read.csv('gene_count_2.csv',row.names = 'X'))

#----read transcript-level count matrix-----#
sample8_CD4_transcript=as.matrix(read.csv("transcript_count_1.csv",row.names = 'X'))
gene_transcript_CD4_df = data.frame(genes=str_remove(colnames(sample8_CD4_transcript),"\\.(ENST|novel|uncategorized).+"),
                                    transcripts=colnames(sample8_CD4_transcript))

sample8_CD8_transcript=as.matrix(read.csv("transcript_count_2.csv",row.names = 'X'))
gene_transcript_CD8_df = data.frame(genes=str_remove(colnames(sample8_CD8_transcript),"\\.(ENST|novel|uncategorized).+"),
                                    transcripts=colnames(sample8_CD8_transcript))

#----gene-level analysis-----#
df_gene = scotch_gene(sample8_CD4_gene, sample8_CD8_gene), epsilon=0.01,ncores=10)%>%
  filter(pct1>=0.01|pct2>=0.01)

#----transcript-level analysis-----#
df_transcript = scotch_transcript(gene_transcript_CD4_df,gene_transcript_CD8_df, 
                                  sample8_CD4_transcript, sample8_CD8_transcript, ncores=10)
df_scotch = df_gene%>%left_join(df_transcript,by=c('genes'='gene'))

SCOTCH will output the following statistics in results:

genes: gene name
pct1: percent of cells expression the gene in cell population 1
pct2: percent of cells expression the gene in cell population 2
logFC: fold change in log scale
p_gene: p value on the gene level
pct_diff: difference between pct1 and pct2
p_gene_adj: adjusted p value of differential gene expression
isoforms: isoform name
alpha1: estimated parameter of the dirichlet distribution for cell population 1
alpha2: estimated parameter of the dirichlet distribution for cell population 2
TU1: relative transcript usage of cell population 1
TU2: relative transcript usage of cell population 2
TU_diff: difference of TU1 and TU2
TU_var1: variance of TU1
TU_var2: variance of TU2
dispersion1: dispersion parameter for cell population 1
dispersion2: dispersion parameter for cell population 2
isoform_switch: whether there is isoform switching event
isoform_switch_ES: effect size of isoform switching event
p_DTU_gene: p value for differential transcript usage analysis for the whole gene
p_transcript: p value for differential transcript usage analysis for the specific transcript
p_transcript_adj: adjusted p value for differential transcript usage analysis for the specific transcript
p_DTU_gene_adj: adjusted p value for differential transcript usage analysis for the whole gene