ODEGRfinder

An NMF-based approach to discover overlooked differentially expressed gene regions from single-cell RNA-seq data

Reference

Under submission

Requirements

ODEGRfinder is mainly written with R and uses "NMF" library. We also use "GenomicRanges", "rtracklayer", "Rsamtools", and "stringr" library in data pre-processing. In the manuscript, the following package versions were used.

NMF: 0.20.5
GenomicRanges: 1.30.3
rtracklayer: 1.38.3
Rsamtools: 1.30.0
stringr: 1.2.0

Download

git clone https://github.com/hmatsu1226/ODEGRfinder
cd ODEGRfinder

Or download from "Download ZIP" button and unzip it.

Dataset in the manuscript

The mES-PrE dataset and hNSC-NC dataset are available at https://doi.org/10.6084/m9.figshare.7410509.v1 and https://doi.org/10.6084/m9.figshare.7410512, respectively.

A small dataset to demo the code

Running ODEGRfinder with a small dataset (20 genes and 185 cells). The computational time is about 5 minutes with MacBook Pro (2.5GHz Intel Core i7 and 16GB 2133MHz LPDDR3). The detailed explanation of each step are described at NMF, t-test, and Calculate Delta(Tnmf-Ttpm).

NMF for read count matrix

Rscript NMF_for_countdata.R demo_data/isoverlap.txt demo_data/isunmappable.txt demo_count_data demo_out 1 20 185 2 123456 lee
Rscript NMF_for_countdata.R demo_data/isoverlap.txt demo_data/isunmappable.txt demo_count_data demo_out 1 20 185 5 123456 lee
Rscript NMF_for_countdata.R demo_data/isoverlap.txt demo_data/isunmappable.txt demo_count_data demo_out 1 20 185 10 123456 lee

t-test for NMF results

Rscript ttest_NMF.R demo_data/cell_label.txt demo_out/NMF_2_coef_1_20.txt demo_out/ttest_result_NMF_2.txt 20 2
Rscript ttest_NMF.R demo_data/cell_label.txt demo_out/NMF_5_coef_1_20.txt demo_out/ttest_result_NMF_5.txt 20 5
Rscript ttest_NMF.R demo_data/cell_label.txt demo_out/NMF_10_coef_1_20.txt demo_out/ttest_result_NMF_10.txt 20 10

t-test for TPM matrix

Rscript ttest_TPM.R demo_data/TPM_ES_PrE.txt demo_data/TPM_transcriptid.txt demo_data/cell_label.txt demo_data/mygtf_gene.txt demo_data/transcriptid_to_geneid.txt demo_out/ttest_result_TPM.txt

Calculate Delta(Tnmf-Ttpm)

Rscript calc_DeltaT_NMF_TPM.R demo_out demo_out demo_out/DeltaT_NMF_TPM.txt 20 10

Pre-processing

Make read count matrix

Make read count matrix for each gene region from bigwig files of scRNA-seq.

Usage

Rscript extract_count_data_from_bw.R <Input_file1> <Input_file2> <Output_dir> <idx1> <idx2>

Input_file1 : annotation file of target gene regions
Input_file2 : paths to bigwig files
Output_dir : output directory
idx1 : index 1
idx2 : index 2

Example

Rscript extract_count_data_from_bw.R ES_PrE/data/mygtf_gene.txt ES_PrE/fbw_ES_PrE.txt ES_PrE/count_data 1 1000

Format of Input_file1

Input_file1 is the gtf file of length G, where G is the number of target gene regions. Each row represents a region of a gene.

Example of Input_file1

chr1	HAVANA	gene	5070018	5162529	.	+	.	gene_id "ENSMUSG00000033793.12"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "Atp6v1h"; level 2; havana_gene "OTTMUSG00000050145.9";
chr1	HAVANA	gene	6206197	6276648	.	+	.	gene_id "ENSMUSG00000025907.14"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "Rb1cc1"; level 2; havana_gene "OTTMUSG00000033467.12";
chr1	HAVANA	gene	7088920	7173628	.	+	.	gene_id "ENSMUSG00000051285.17"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "Pcmtd1"; level 2; havana_gene "OTTMUSG00000043373.5";
...

Format of Input_file2

Input_file2 is the list of the paths for bigwig file of each cell. Because of the large size of bigwig files, the example datasets in the figshare do not contain this file and bigwig files.

Example of Input_file2

ES_PrE/data/bw/RamDA_00h_A04.bw
ES_PrE/data/bw/RamDA_00h_A05.bw
ES_PrE/data/bw/RamDA_00h_A06.bw
...

Output_dir

The read count matrix for each gene region are saved in the Output_dir directory. The Output_dir/data_n.txt represents the C x L count matrix for the n-th gene region in Input_file1, where C is the number of cells and L is the number of bins for n-th gene.

idx1 and idx2

The read count matrix from idx1-th row to idx2-th row in "Input_file1" are calculated. The count matrix can be calculated in parallel by dividing the start and end index.

Make "isoverlap.txt" file

Make "isoverlap.txt" file to check whether a bin in a gene region is overlapped with different genes.

Usage

ruby extract_overlap.rb <Input_file1> <Input_file2> <Output_file> <G>

Input_file1 : GENCODE annotation file
Input_file2 : annotation file of target gene regions
Ouput_file : "isoverlap.txt"
G : the number of target gene regions

Example

ruby extract_overlap.rb mm10/gencode.vM9.annotation.gtf ES_PrE/data/mygtf_gene.txt ES_PrE/data/isoverlap.txt 4965

Format of Input_file1

Input_file1 is the GENCODE gene annotation file and is downloaded from https://www.gencodegenes.org.

Format of Output_file ("isoverlap.txt")

Each row of Output_file contains tab-separated values (0 or 1) of length L, where L is the number of bins for a gene. The value 1 indicates that the bin is overlapped with different genes.

Make "isunmappable.txt" file

Make "isunmappable.txt" file to check whether a mappability of a bin in a gene region is low.

Usage

rm <Output_file>
Rscript extract_mappability.R <Input_file1> <Input_file2> <Output_file> <a>

Input_file1 : Mappability file
Input_file2 : Gene regions annotation file
Output_file : "isunmappable.txt"
a : threshold to determine isunmappable

Example

rm ES_PrE/data/isunmappable.txt
Rscript extract_mappability.R mm10/k24.umap.bw ES_PrE/data/mygtf_gene.txt ES_PrE/data/isunmappable.txt 0.5

Format of Input_file1

The bigwig file of mappability of 24-bp and is downloaded from https://bismap.hoffmanlab.org.

Format of Output_file ("isunmappable.txt")

Each row of Output_file contains tab-separated values (0 or 1) of length L, where L is the number of bins for a gene. The value 1 indicates that the minimum of the mappability of the bin is less than "a".

NMF

NMF for read count matrix in parallel

Run NMF for read count matrix of each gene regions.

Usage

Rscript NMF_for_countdata.R <Input_file1> <Input_file2> <Data_dir> <Output_dir> <idx1> <idx2> <C> <K> <seed> <method>

Input_file1 : isoverlap.txt
Input_file2 : isunmappable.txt
Data_dir : directory of read count data
Output_dir : output directory
idx1 : start index
idx2 : end index
C : the number of cells
K : the factorization rank of NMF
seed : numerical seed
method : NMF algorithm, such as "lee", "snmf/l", and "snmf/r" (see https://cran.r-project.org/web/packages/NMF/index.html)

Example

Rscript NMF_for_countdata.R ES_PrE/data/isoverlap.txt ES_PrE/data/isunmappable.txt ES_PrE/count_data ES_PrE/out1 1 2000 185 5 123456 lee

Data_dir and idx1,2

The script computes NMF for Data_dir/data_idx1.txt, Data_dir/data_(idx1+1).txt, ..., Data_dir/data_idx2.txt

Output_dir

The script output Output_dir/NMF_K_coef_idx1_idx2.txt, which is the C x ((idx2-idx1+1)*K) matrix. The coefficient matrix of i-th gene is saved column from (i-idx1)*K+1 to (i-idx1)*K+K of NMF_K_coef_idx1_idx2.txt.

Combine NMF results

Combine Output_dir/NMF_K_coef_a_b.txt, Output_dir/NMF_K_coef_c_d.txt, .....

Usage

Rscript merge_NMF_coef.R <Input_dir> <C> <G> <K> <idxs1> <idxs2>

Input_dir : the directory of Output_dir in NMF_for_countdata.R
C : the number of cells
G : the number of genes
K : the factorization rank of NMF
Idxs1 : list of index of idx1 in NMF_for_countdata.R (separated with ",")
Idxs2 : list of index of idx2 in NMF_for_countdata.R (separated with ",")

Example

Rscript merge_NMF_coef.R ES_PrE/out1 185 4965 5 1,2001 2000,4965

Output

The script output Input_dir/NMF_K_coef_all.txt, which is the C x (G*K) matrix.

t-test

t-test for NMF results

Usage

Rscript ttest_NMF.R <Input_file1> <Input_file2> <Output_file> <G> <K>

Input_file1 : cell labels
Input_file2 : coefficients of NMF
Output_file : result of t-test
G : the number of genes
K : the factorization rank of NMF

Example

Rscript ttest_NMF.R ES_PrE/data/cell_label.txt ES_PrE/out1/NMF_5_coef_all.txt ES_PrE/out1/ttest_result_NMF_5.txt 4965 5

Format of Input_file1

The Input_file1 is the C x 2 Cell label table. The first column corresponds to the sample names, and the second column corresponds to the cluster labels (1 or 2).

Example of Input_file1

PDIS9061.001	1
PDIS9061.002	1
PDIS9061.003	2

Format of Input_file2

Input_file2 is the results of NMF and correspond to Input_dir/NMF_K_coef_all.txt.

Format of Output_file

Output_file is the G x (K*2) matrix. Each row represents the result of t-test for coefficient of NMF. The 1st to Kth column represents the t-statistics for each coefficient elements, and the (K+1)th to 2*K th column represent the -log10(p-value) of corresponding t-statistics.

t-test for TPM matrix

Usage

Rscript ttest_TPM.R <Input_file1> <Input_file2> <Input_file3> <Input_file4> <Input_file5> <Output_file>

Input_file1 : TPM matrix including header and rowname
Input_file2 : list of transcript id
Input_file3 : list of cell labels
Input_file4 : gene regions annotation file
Input_file5 : correspondence table between transcript id and gene id
Output_file : result of t-test

Example

Rscript ttest_TPM.R ES_PrE/data/TPM_ES_PrE.txt ES_PrE/data/TPM_transcriptid.txt ES_PrE/data/cell_label.txt ES_PrE/data/mygtf_gene.txt ES_PrE/data/transcriptid_to_geneid.txt ES_PrE/out/ttest_result_TPM.txt

Format of Input_file1

The Input_file1 is the T x C TPM matrix including header and row name, where T is the number of transcripts and C is the number of cells. The first row corresponds to the sample name, and the first column corresponds to the transcript id.

Format of Input_file2

The Input_file2 is the list of transcript id of length T.

Format of Input_file5

The Input_file5 is the correspondence table between transcript id and gene id. The first column corresponds to the transcript id, and the second column corresponds to the gene id. Each row represents the correspondence between a transcript id and a gene id.

Format of Output_file

The Output_file represents the result of t-test for each gene region. The first and second column represents T+ and T-, respectively. The third and fourth column represents the -log10(p-value) of T+ and T-, respectively. The fifth and sixth column represents the transcript ids correspond to T+ and T-, respectively.

t-test for Mean

Usage

Rscript ttest_Mean.R <Input_file1> <Input_file2> <Input_file3> <Input_dir> <Output_file> <G>

Input_file1 : list of cell labels
Input_file2 : isoverlap.txt
Input_file3 : isunmappable.txt
Input_dir : directory of read count data
Output_file : result of t-test
G : the number of genes

Example

Rscript ttest_Mean.R ES_PrE/data/cell_label.txt ES_PrE/data/isoverlap.txt ES_PrE/data/isunmappable.txt ES_PrE/count_data ES_PrE/out/ttest_result_Mean.txt 4965

Format of Output_file

The Output_file represents the result of t-test of the mean values of mapped count for each gene region. The first column represents T and the second column represents -log10(p-value), respectively.

Calculate Delta(Tnmf-Ttpm)

Usage

Rscript calc_DeltaT_NMF_TPM.R <Input_dir1> <Input_dir2> <Output_file> <G> <a>

Input_dir1 : directory for t-test results of NMF
Input_dir2 : directory for t-test result of TPM
Output_file : output file
G : the number of genes
a : threshold to ignore DE when Ttpm is sufficiently large

Example

Rscript calc_DeltaT_NMF_TPM.R ES_PrE/out1 ES_PrE/out ES_PrE/out1/DeltaT_NMF_TPM.txt 4965 10

Format of Output_file

The Output_file represents Delta(Tnmf-Ttpm) values for each gene region. The first column represents Delta(Tnmf-Ttpm) values and the second column represents the rank of the score in descending order. Each row represents the result corresponding to the row of gene regions annotation file.

permutation-based test for Delta(Tnmf-Ttpm)

t-test for NMF results with shuffled cell labels.

Usage

Rscript ttest_NMF_label_shuffling.R <Input_file1> <Input_file2> <Output_file> <G> <K>

Input_file1 : cell labels
Input_file2 : coefficients of NMF
Output_file : result of t-test
G : the number of genes
K : the factorization rank of NMF

Example

Rscript ttest_NMF_label_shuffling.R ES_PrE/data/cell_label.txt ES_PrE/out1/NMF_2_coef_all.txt ES_PrE/out_shuffle1/ttest_result_NMF_2.txt 4965 2
Rscript ttest_NMF_label_shuffling.R ES_PrE/data/cell_label.txt ES_PrE/out1/NMF_5_coef_all.txt ES_PrE/out_shuffle1/ttest_result_NMF_5.txt 4965 5
Rscript ttest_NMF_label_shuffling.R ES_PrE/data/cell_label.txt ES_PrE/out1/NMF_10_coef_all.txt ES_PrE/out_shuffle1/ttest_result_NMF_10.txt 4965 10

Calculate Delta(Tnmf-Ttpm) for shuffled data

Example

Rscript calc_DeltaT_NMF_TPM.R ES_PrE/out_shuffle1 ES_PrE/out ES_PrE/out_shuffle1/DeltaT_NMF_TPM.txt 4965 10

Permutation-based test

Usage

Rscript permutation_based_test.R <Input_file1> <Input_file2> <Output_file>

Input_file1 : Delta(Tnmf-Ttpm) for original data
Input_file2 : Delta(Tnmf-Ttpm) for shuffled data
Output_file : result of permutation-based test

Example

Rscript permutation_based_test.R ES_PrE/out1/DeltaT_NMF_TPM.txt ES_PrE/out_shuffle1/DeltaT_NMF_TPM.txt ES_PrE/out_shuffle1/result_permutation_test.txt

Format of Output_file

The Output_file represents Delta(Tnmf-Ttpm) values and -log10(p-value) for each gene region. The first column represents Delta(Tnmf-Ttpm) values and the second column represents -log10(p-value). Each row represents the result corresponding to the row of gene regions annotation file.