A toolkit for comprehensive analysis on multiple platform transcriptome data
RNA-combine mainly contains three modules, NGS bulk RNA-seq data analysis, scRNA-seq data analysis, PacBio data analysis
Dependencies:
python>=3.7
R>=3.6
Build a new environment is suggested
conda create -n rna-combine python=3.7
source activate rna-combine
Download the codes to your directory and build the environment using below code
conda install --file=requirement.txt
To use function modules in each directories, first, you need to edit parameters for concerned modules in configuration file, such as the path of software, the path of raw sequencing data. This process is quite simple, because all parameters are explained in configuration file. Then, run corresponding bash commands.
Pre-process the bulf RNA-seq data.
Inputs are fastq files.
Outputs mainly include aligned bam files and gene count matrix, with intermediate files saved in 4 directories. 1.rm_rrna save sequences after removing rRNA sequences, with log files recording rRNA information for each sample. 2.trim save sequences after cutting adapters and low quality bases, with log files recording information of cut sequences. 3.align save aligned bam files, with log files recording alignment information. 4.count save gene count files which annotate sequences to annotated reference gtf files, with log and summary files recording annotation details.
To use it, first you need to edit the configuration file "conf_prerna.txt".
Second, it is optional to get the index files for rRNA and reference genome, depending on whether you have index files or not.
./build_index.sh
Third, simply run pre_process.sh
./pre_process.sh
Then all results will be outputted to same directory of fastq files.
Explore the differential expressed genes in different conditions, utilizing the bulk RNA-seq data.
Input is gene count file.
For outputs, "sample_correlation_plot.png" shows the sample distances.
"volcano_plot.png" illustrates the significantly differential genes in differential conditions.
"de_gene_foldchange_above_0.csv" and "de_gene_foldchange_above_0.csv" give more details about DE genes. Be sure to read comparison mode in the output.
To use it, first you need to edit the configuration file "conf_DE.txt".
Second, simply run "DE.sh"
./DE.sh
Function enrichment based on significant DE genes.
Input is a gene list.
For outputs, function_annotation.pdf shows the enriched Go terms of KEGG pathways.
To use it, first, you need to edit the configuration file "conf_function.txt".
Second you simply run "function.sh"
./function.sh
Then all results will be outputted to output directory.
Explore the gene co-expression pattern based on Spearman, Pearson, PCA-PMI(Juan Zhao et al. PNAS, 2016).
Input is a gene expression matrix, rows are genes, columns are samples.
For outputs, pdf files illustrate gene co-expression pattern, csv files give more details about coefficients.
To use it, first, you need to edit the configuration file "conf_relation.txt".
Second you simply run "relation.sh"
./relation.sh
Then all results will be outputted to output directory.
Explore differential gene co-expression patterns in different conditions, mainly utilizing SSN(Xiaoping Liu et al. Nucleic Acids Research, 2016).
For input, normal_gene_matrix is used to build background network, case_gene_matrix was added to background network to form perturbation network. csv format is needed.
Then the differential gene relations(differential edges) were calculated between normal samples and case samples.
For outputs, the coefficients of differential edges are given, including Pearson correlation deviation between normal samples and case samples, p-value for differential edges.
To use it, first, you need to edit the configuration file "conf_ssn.txt".
Second you simply run "SSN.sh"
./SSN.sh
Then all results will be outputted to output directory.
Detect variants based on RNA-seq data, including SNP and INDEL.
Inputs are bam and sam files produced by pre_process.sh
Two methods are provided, including GATK and strelka2.
For outputs, if you choose GATK for analysis, xxxxxx.pass.snp.indel.vcf.gz* file gives details of called SNPs and INDELs. If you choose strelka2 for analysis, somatic.snvs.vcf.gz gives details of called SNPs and somatic.indels.vcf.gz gives details of called INDELs. For GATK analysis, to use it, first, you need to edit the configuration file "conf_mutation.txt".
Second you simply run "gatk.sh"
./mutaion.sh
Then all results will be outputted to the directory of bam files.
For strelka2 analysis, to use it, first, you need to edit the configuration file "conf_mutation.txt". Second you simply run "strelka.sh"
./strelka.sh
Then all results will be outputted to the directory you given.
RNA splicing analysis includes three kinds of methods, including exon-based method: DEXseq, transcript-based method(not only):StringTie+ballgown, event-based method: rMATS
For outputs of DEXseq, Significant_differential_exons.csv give details about differential exons in different conditions, while exon_plot_for_one_gene.pdf visualize differential exon information for specific gene. For outputs of StringTie+ballgown, Significant_differential_transcripts.csv give details about differential transcripts in different conditions, while Transcipt_plot_for_one_gene.pdf visualize differential transcript information for specific gene. For outputs of rMATS, txt files give details about differential splicing events in different conditions(Search rMATS for description for each file), while event_plot save visualization information for all differential splicing events.
To use it, first, you need to edit the configuration file "conf_splicing.txt". Choose the method, then edit the corresponding module.
Second you simply run "Splicing_pipeline.sh"
./Splicing_pipeline.sh
Then all results will be outputted to output directory.
For scRNA data analysis, 7 functions are introduced.
Turn 10X fastq files to cell_gene matrix, only support data produced by 10X platform. This process mainly utilizes CellRanger software.
The gene-cell matrix was at "outs/filtered_feature_bc_matrix" directory. which could be the input of doublet_detection.sh module.
To use it, first you need to edit cell_ranger module of the configuration file conf_scRNA.txt.
Second you need simply run cell_clustering.sh
./cellranger.sh
Then the results will be outputted to output directory
Detect doublets based on gene-cell matrix
For outputs, umap_doublet_plot.png is UMAP plot of doublets and single cells. cell_gene_filtered_matrix.csv is gene-cell matrix under filtering of doublets, which could be input of cell_clustering.sh module
To use it, first you need to edit doublet_detection.sh module of the configuration file conf_scRNA.txt.
Second you need simply run doublet_detection.sh
./doublet_detection.sh
Then the results will be outputted to output directory
Designed for quality control, data dimension reduction, cell clustering, marker gene detection for each cluster et. This process mainly utilizes scanpy python package.
The input is 10X gene-cell matrix.
For output, "highest_expressed_genes.png" illustrates the highest expressed genes in this sample.
"PCA_variance_ratio.png" shows the variance of PCs, PC1 to 'elbow' PC of this plot are commonly used for subsequent analysis, which is corresponding to "num_of_PCs_1" parameter in configuration file.
"violin_for_quality.png" shows the quality of your data includes the number of genes, the number of UMIs, the percentage of mitochondrial genes.
"umap_plot_clusters.png" shows the cell clusters enriched.
"object.h5ad" is the scanpy object that contains all information of analyzed data, which could be input for othe analyses.
To use it, first you need to edit cell_clustering module of the configuration file conf_scRNA.txt.
Second you need simply run cell_clustering.sh
./cell_clustering.sh
Then the results will be outputted to output directory
Plot the gene expression in the form of violin plot or UMAP plot.
First you need to edit plot_gene_scrna module of the configuration file conf_scRNA.txt .
Second you need simply run plot_gene_scrna.sh
./plot_gene_scrna.sh
Then the results will be outputted to output directory
Predict the cell type based on your provided significant gene(SG) list. For each SG, the program searches corresponding cell type to it in database. Finally, the program summarize what cell type it is most like. Right now, cell marker datasets in CellMarker database and PanglaoDB database are downloaded and cleaned, you can choose to use them.
To use it, first you need to edit search_cell_type module of the configuration file conf_scRNA.txt .
Second you need simply run search_cell_type.sh
./search_cell_type.sh
Then the results will be outputted to output directory
After searching the cell type of each cell cluster, you need to label the cell type information in scRNA object.
To use it, first you need to edit label_celltype module of the configuration file conf_scRNA.txt.
Second you need simply run label_celltype.sh
./label_celltype.sh
Then the results will be outputted to output directory
You may want to detect the evolutionary relations among different cell types, you can use this function to do so.
To use it, first you need to edit trajectory module of the configuration file conf_scRNA.txt.
Second you need simply run trajectory.sh
./trajectory.sh
Then the results will be outputted to output directory
For PacBio RNA data analysis, 2 functions are introduced.
Run install.sh before run pipelines for installing needed software.
Turn raw read bam files to full-length, non-concatemer, unique transcript files. Fasta outputs are divided into HQ and LQ files, and only one consensus for one transcript cluster. To use it, first you need to edit Iso-seq.sh module of the configuration file conf_pacbio.txt. Second you need simply run Iso-seq.sh
./Iso-seq.sh
Then the results will be outputted to output directory. Among all of results, (prefix).hq.fasta.gz are isoforms with predicted accuracy ≥ 0.99, (prefix).lq.fasta.gz are isoforms with predicted accuracy < 0.99.
Align sub-reads to reference genome, two methods are provided, including minimap2 and blasr, all take sequenced bam files as input.
For output, if you choose blasr, ...alignments.bam file is aligned file. If you choose minimap2, ...sam file is aligned file.
To use it, first you need to edit map.sh module of the configuration file conf_pacbio.txt.
Second you need simply run map.sh
./map.sh
Then the results will be outputted to output directory
Methods | Description |
---|---|
Sortmerna | Removing rRNA sequences |
Hisat2 | Aligning reads to reference genome |
samtools | Converting file formats and auxiliary functionalities |
featureCounts | Counting number of reads on genes |
DESeq2, limma, edgeR | Differential gene analysis |
ClusterProfiler | Function enrichment |
DEXseq, StringTie, rMATS | Splicing site detection |
GATK, strelka2 | Variant calling |
CellRanger | Produce barcode-gene matrix based on scRNA fastq files |
Scrublet | Doublet detection in scRNA data |
Scanpy | filtering, normalization, clustering, dimension reduction, trajectory analysis (and so on) of scRNA data |
isoseq3 | Produce full-length, non-concatemer, unique transcripts based on PacBio raw bam read files |
blasr, minimap2 | Map PacBio reads to reference genome |
Data | Source |
---|---|
Human reference genome | Baidu_disk:7788 |
Human gene annotation file | Baidu_disk:2233 |
Test bulk fastq file | Baidu_disk:1122 |
Test scRNA fastq file | Download by "fastq-dump --gzip --split-files SRR5799774.sra" |
Test IsoSeq bam file | link |
DONG Xuemin, Institute of Zoology, CAS, dongxuemin18@mails.ucas.ac.cn