The Mime
package provides a user-friendly solution for constructing
machine learning-based integration models from transcriptomic data.
With the widespread use of high-throughput sequencing technologies, understanding biology and cancer heterogeneity has been revolutionized. Mime streamlines the process of developing predictive models with high accuracy, leveraging complex datasets to identify critical genes associated with disease progression, patient outcomes, and therapeutic response.
It offers four main applications:
- Establishing prognosis models using 10 machine learning algorithms.
- Building binary response models with 7 machine learning algorithms.
- Conducting core feature selection related to prognosis using 8 machine learning methods.
- Visualizing the performance of each model.
You can install the development version of Mime like so:
# options("repos"= c(CRAN="https://mirrors.tuna.tsinghua.edu.cn/CRAN/"))
# options(BioC_mirror="http://mirrors.tuna.tsinghua.edu.cn/bioconductor/")
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
depens<-c('GSEABase', 'GSVA', 'cancerclass', 'mixOmics', 'sparrow', 'sva' , 'ComplexHeatmap' )
for(i in 1:length(depens)){
depen<-depens[i]
if (!requireNamespace(depen, quietly = TRUE)) BiocManager::install(depen,update = FALSE)
}
if (!requireNamespace("CoxBoost", quietly = TRUE))
devtools::install_github("binderh/CoxBoost")
if (!requireNamespace("fastAdaboost", quietly = TRUE))
devtools::install_github("souravc83/fastAdaboost")
if (!requireNamespace("Mime", quietly = TRUE))
devtools::install_github("l-magnificence/Mime")
library(Mime1)
This is a basic example which shows you how to use Mime
:
Mime need multiple cohorts containing transcriptional sequencing data with information of survival or clinical response to therapy as well as a gene set as inputs. You can dowmload the example data in External data
file to check the format of input.
load("./Example.cohort.Rdata")
list_train_vali_Data[["Dataset1"]][1:5,1:5]
#> ID OS.time OS MT-CO1 MT-CO3
#> TCGA.DH.A66B.01 1281.65322 0 13.77340 13.67931
#> TCGA.HT.7607.01 96.19915 1 14.96535 14.31857
#> TCGA.DB.A64Q.01 182.37755 0 13.90659 13.65321
#> TCGA.DU.8167.01 471.97707 0 14.90695 14.59776
#> TCGA.HT.7610.01 1709.53901 0 15.22784 14.62756
The first column ID
is the sample ID, second to third column OS.time
and OS
are the survival time and status of patients, other columns are the gene expression level scaled with log2(x+1). Dataset1
is the training dataset, while other Datasets as validation.
load("./Example.ici.Rdata")
list_train_vali_Data[["training"]][1:5,1:5]
#> ID Var FTH1 EEF1A1 ACTB
#> SAMf2ce197162ce N 10.114846 4.817746 11.230180
#> ERR2208915 Y 2.044180 5.038854 3.977902
#> G138701_RCCBMS-00141-T_v1_RNA_OnPrem Y 5.406008 5.341635 5.366668
#> SAMe41b1e773582 N 9.215794 4.707360 11.412721
#> SAM5ffd7e4cd794 N 9.003710 3.908884 10.440559
The first column ID
is the sample ID, second column Var
is the theraputic response of patients (N
: No response; Y
: response), other columns are the gene expression level scaled with log2(x+1). training
is the training dataset, while other Datasets as validation.
load("./genelist.Rdata")
#> [1] "MYC" "CTNNB1" "JAG2" "NOTCH1" "DLL1" "AXIN2" "PSEN2" "FZD1" "NOTCH4" "LEF1" "AXIN1" "NKD1" "WNT5B"
#>[14] "CUL1" "JAG1" "MAML1" "KAT2A" "GNAI1" "WNT6" "PTCH1" "NCOR2" "DKK4" "HDAC2" "DKK1" "TCF7" "WNT1"
#>[27] "NUMB" "ADAM17" "DVL2" "PPARD" "NCSTN" "HDAC5" "CCND2" "FRAT1" "CSNK1E" "RBPJ" "FZD8" "TP53" "SKP2"
#>[40] "HEY2" "HEY1" "HDAC11"
This gene set is associated with Wnt/β-catenin signalling from MSigDB.
- We recommend training dataset with more than 100 samples and gene set with more than 50 genes.
library(Mime)
load("./Example.cohort.Rdata")
load("./genelist.Rdata")
res <- ML.Dev.Prog.Sig(train_data = list_train_vali_Data$Dataset1,
list_train_vali_Data = list_train_vali_Data,
unicox.filter.for.candi = T,
unicox_p_cutoff = 0.05,
candidate_genes = genelist,
mode = 'all',nodesize =5,seed = 5201314 )
ML.Dev.Prog.Sig()
provides three modes includingall
,single
, anddouble
.all
means using all ten algorithms and the combinations.single
means using only one of the ten algorithms.double
means using the combination with two algorithms. In most casees, we will generally useall
mode to analysis data.- If you set
unicox.filter.for.candi
asT
(default),Mime
will firstly perform univariable cox regression among provided genes in the training dataset to screen out the prognostic variables which are then used to construct models.
Plot C-index of each model:
cindex_dis_all(res,validate_set = names(list_train_vali_Data)[-1],order =names(list_train_vali_Data),width = 0.35)
We can find that StepCox[forward] + plsRcox
have the highest C-index.
Plot C-index of specific model among different datasets:
cindex_dis_select(res,
model="StepCox[forward] + plsRcox",
order= names(list_train_vali_Data))
- If input object
res
is from modeall
used inML.Dev.Prog.Sig()
, you should define model as specific model name, while define model asSOD
.
Plot survival curve of patients according to risk score calculated by specific model among different datasets:
survplot <- vector("list",2)
for (i in c(1:2)) {
print(survplot[[i]]<-rs_sur(res, model_name = "StepCox[forward] + plsRcox",dataset = names(list_train_vali_Data)[i],
#color=c("blue","green"),
median.line = "hv",
cutoff = 0.5,
conf.int = T,
xlab="Day",pval.coord=c(1000,0.9)))
}
aplot::plot_list(gglist=survplot,ncol=2)
all.auc.1y <- cal_AUC_ml_res(res.by.ML.Dev.Prog.Sig = res,train_data = list_train_vali_Data[["Dataset1"]],
inputmatrix.list = list_train_vali_Data,mode = 'all',AUC_time = 1,
auc_cal_method="KM")
all.auc.3y <- cal_AUC_ml_res(res.by.ML.Dev.Prog.Sig = res,train_data = list_train_vali_Data[["Dataset1"]],
inputmatrix.list = list_train_vali_Data,mode = 'all',AUC_time = 3,
auc_cal_method="KM")
all.auc.5y <- cal_AUC_ml_res(res.by.ML.Dev.Prog.Sig = res,train_data = list_train_vali_Data[["Dataset1"]],
inputmatrix.list = list_train_vali_Data,mode = 'all',AUC_time = 5,
auc_cal_method="KM")
cal_AUC_ml_res()
also provides three modes, which should be consistent with mode uesd byML.Dev.Prog.Sig()
.AUC_time
for 1 year, 2 years, 3 years......, We recommend using the shortest survival time among all datasets.
Here, we only plot 1-year AUC predicted by all models:
auc_dis_all(all.auc.1y,
dataset = names(list_train_vali_Data),
validate_set=names(list_train_vali_Data)[-1],
order= names(list_train_vali_Data),
width = 0.35,
year=1)
Plot ROC of specific model among different datasets:
roc_vis(all.auc.1y,
model_name = "StepCox[forward] + plsRcox",
dataset = names(list_train_vali_Data),
order= names(list_train_vali_Data),
anno_position=c(0.65,0.55),
year=1)
Plot 1, 3, and 5-year AUC of specific model among different datasets:
auc_dis_select(list(all.auc.1y,all.auc.3y,all.auc.5y),
model_name="StepCox[forward] + plsRcox",
dataset = names(list_train_vali_Data),
order= names(list_train_vali_Data),
year=c(1,3,5))
unicox.rs.res <- cal_unicox_ml_res(res.by.ML.Dev.Prog.Sig = res,optimal.model = "StepCox[forward] + plsRcox",type ='categorical')
metamodel <- cal_unicox_meta_ml_res(input = unicox.rs.res)
meta_unicox_vis(metamodel,
dataset = names(list_train_vali_Data))
type
includescategorical
andcontinuous
.categorical
means that the patients are divided into two subgroups based on the median of the risk score.continuous
means performing the univariable Cox regression bu using the continuous risk score.
rs.glioma.lgg.gbm <- cal_RS_pre.prog.sig(use_your_own_collected_sig = F,type.sig = c('LGG','GBM','Glioma'),
list_input_data = list_train_vali_Data)
cal_RS_pre.prog.sig()
will calculate the risk score based on the signatures from previous papers.- If
use_your_own_collected_sig
is set as T, you should provide a data frame containing the information of the signatures. The column names of the data frame are model, PMID, Cancer, Author, Coef and symbol. Themodel
consists of the first name of the first author and PMID of paper.Cancer
uses abbreviations like the format of TCGA.Author
is the first name of the first author.Coef
is the coefficient of each variable.symbol
is the gene name. Otherwise, we use our collected models of glioma.
Compare the HR of specific model with previously published models:
HR_com(rs.glioma.lgg.gbm,
res,
model_name="StepCox[forward] + plsRcox",
dataset=names(list_train_vali_Data),
type = "categorical")
cc.glioma.lgg.gbm <- cal_cindex_pre.prog.sig(use_your_own_collected_sig = F,type.sig = c('Glioma','LGG','GBM'),
list_input_data = list_train_vali_Data)
cal_cindex_pre.prog.sig()
will calculate the C-index based on the signatures from previous papers like the fuctioncal_RS_pre.prog.sig()
.
Compare the C-index of specific model with previously published models:
cindex_comp(cc.glioma.lgg.gbm,
res,
model_name="StepCox[forward] + plsRcox",
dataset=names(list_train_vali_Data))
auc.glioma.lgg.gbm.1 <- cal_auc_pre.prog.sig(use_your_own_collected_sig = F,
type.sig = c('Glioma','LGG','GBM'),
list_input_data = list_train_vali_Data,AUC_time = 1,
auc_cal_method = 'KM')
cal_auc_pre.prog.sig()
will calculate the AUC based on the signatures from previous papers like the fuctioncal_RS_pre.prog.sig()
.AUC_time
is like the requirement bycal_AUC_ml_res()
.
Compare the AUC of specific model with previously published models:
auc_comp(auc.glioma.lgg.gbm.1,
all.auc.1y,
model_name="StepCox[forward] + plsRcox",
dataset=names(list_train_vali_Data))
After completing the risk grouping, users can perform downstream analysis on the grouped data. Here, we combine Mime with the R package immunedeconv to assist users in quickly previewing immune infiltration. If users require a more precise immune infiltration analysis, they can fine-tune the parameters themselves.
devo <- TME_deconvolution_all(list_train_vali_Data)
- If you want to use this function, you should install package
immunedeconv
ahead. TME_deconvolution_all()
includes 10 deconvolution methods ("quantiseq", "xcell", "epic", "abis", "mcp_counter", "estimate", "cibersort", "cibersort_abs", "timer", "consensus_tme") fromimmunedeconv::deconvolution_methods
. By default, deconvolution methods are set as ("xcell", "epic", "abis", "estimate", "cibersort", "cibersort_abs").
Show the results:
immuno_heatmap(res,
devo,
model_name="StepCox[backward] + plsRcox",
dataset="Dataset1")
load("./Example.ici.Rdata")
load("./genelist.Rdata")
res.ici <- ML.Dev.Pred.Category.Sig(train_data = list_train_vali_Data$training,
list_train_vali_Data = list_train_vali_Data,
candidate_genes = genelist,
methods = c('nb','svmRadialWeights','rf','kknn','adaboost','LogitBoost','cancerclass'),
seed = 5201314,
cores_for_parallel = 60
)
ML.Dev.Pred.Category.Sig()
develop the predictive model for the binary variables with machine learning algorithms.
Plot AUC of different methods among different datasets:
auc_vis_category_all(res.ici,dataset = c("training","validation"),
order= c("training","validation"))
Plot ROC of specific method among different datasets:
plot_list<-list()
methods <- c('nb','svmRadialWeights','rf','kknn','adaboost','LogitBoost','cancerclass')
for (i in methods) {
plot_list[[i]]<-roc_vis_category(res.ici,model_name = i,dataset = c("training","validation"),
order= c("training","validation"),
anno_position=c(0.4,0.25))
}
aplot::plot_list(gglist=plot_list,ncol=3)
Compared AUC with other published models associated with immunotherapy response:
auc.other.pre <- cal_auc_previous_sig(list_train_vali_Data = list_train_vali_Data,seed = 5201314,
train_data = list_train_vali_Data$training,
cores_for_parallel = 32)
cal_auc_previous_sig()
will calculate the AUC based on the signatures from previous papers for immunotherapy response.cores_for_parallel
means the cores you can choose for parallel operation. If multi-cores condition is error, please setcores_for_parallel
as 1.
Plot comparison results of specific model:
auc_category_comp(res.ici,
auc.other.pre,
model_name="svmRadialWeights",
dataset=names(list_train_vali_Data))
load("./Example.cohort.Rdata")
load("./genelist.Rdata")
res.feature.all <- ML.Corefeature.Prog.Screen(InputMatrix = list_train_vali_Data$Dataset1,
candidate_genes = genelist,
mode = "all",nodesize =5,seed = 5201314 )
ML.Corefeature.Prog.Screen()
provides three modes includingall
,single
, andall_without_SVM
.all
mode means using all eight methods ("RSF", "Enet", "Boruta", "Xgboost", "SVM-REF", "Lasso", "CoxBoost', "StepCox") for selecting.single
mode means using only one method for running. If you usesingle
mode,single_ml
should be specified among eight methods. Since SVM takes too long time, we define other seven methods used for selecting asall_without_SVM
mode.- The output genes are closely associated with patient outcome and higher frequence of screening means more critical.
Upset plot of genes filtered by different methods:
core_feature_select(res.feature.all)
Plot the rank of genes filtered by different methods:
core_feature_rank(res.feature.all, top=20)
Here, we randomly select top two genes to analyze their correlation:
dataset_col<-c("#3182BDFF","#E6550DFF")
corplot <- list()
for (i in c(1:2)) {
print(corplot[[i]]<-cor_plot(list_train_vali_Data[[i]],
dataset=names(list_train_vali_Data)[i],
color = dataset_col[i],
feature1="PSEN2",
feature2="WNT5B",
method="pearson"))
}
aplot::plot_list(gglist=corplot,ncol=2)
Plot survival curve of patients according to median expression level of specific gene among different datasets:
survplot <- vector("list",2)
for (i in c(1:2)) {
print(survplot[[i]]<-core_feature_sur("PSEN2",
InputMatrix=list_train_vali_Data[[i]],
dataset = names(list_train_vali_Data)[i],
#color=c("blue","green"),
median.line = "hv",
cutoff = 0.5,
conf.int = T,
xlab="Day",pval.coord=c(1000,0.9)))
}
aplot::plot_list(gglist=survplot,ncol=2)
If you use Mime in the study, please cite the following publication:
- Liu H, Zhang W, Zhang Y, et al. Mime: A flexible machine-learning framework to construct and visualize models for clinical characteristics prediction and feature selection. Comput Struct Biotechnol J. 2024.
Any technical question please list in Issues section.