/IRLS

Machine learning-based integrative analysis develops an immune-derived lncRNA signature for improving clinical outcomes in colorectal cancer

Primary LanguageR

IRLS

Machine learning-based integrative analysis develops an immune-derived lncRNA signature for improving clinical outcomes in colorectal cancer

Consensus clustering

According to the infiltration profile of various immune cells, a resampling-based method termed consensus clustering was applied for cluster discovery. This process was performed by ConsensusClusterPlus package[1]. Subsample 80% of samples at each iteration and partition each subsample into up to k (max K = 9) groups by k-means algorithm upon Euclidean distance. This process was repeated for 1,000 repetitions. Subsequently, the consensus score matrix, cumulative distribution function (CDF) curve, proportion of ambiguous clustering (PAC) score, and Nbclust were synthetically used to determine the optimal number of clusters. A higher consensus score between two samples indicates they are more likely to be grouped into the same cluster in different iterations. The consensus values range from 0 (never clustered together) to 1 (always clustered together) marked by white to dark brown. In the CDF curve of a consensus matrix, the lower left portion represents sample pairs rarely clustered together, the upper right portion represents those almost always clustered together, whereas the middle segment represents those with ambiguous assignments in different clustering runs. The "proportion of ambiguous clustering" (PAC) measure quantifies this middle segment; and is defined as the fraction of sample pairs with consensus indices falling in the interval (u1, u2) ∈ [0, 1] where u1 is a value close to 0 and u2 is a value close to 1 (for instance u1=0.1 and u2=0.9). A low value of PAC indicates a flat middle segment, and a low rate of discordant assignments across permuted clustering runs. PAC for each K is CDFk(u2) - CDFk(u1)[2]. According to his criterion, we can therefore infer the optimal number of clusters by the K value having the lowest PAC. The Nbclust uses 26 mathematic criteria to select the optimal number.

Signature generated from machine learning based integrative approaches

To develop a consensus immune-related lncRNA signature (IRLS) with high accuracy and stability performance, we integrated 10 machine learning algorithms and 101 algorithm combinations. The integrative algorithms included random survival forest (RSF), elastic network (Enet), Lasso, Ridge, stepwise Cox, CoxBoost, partial least squares regression for Cox (plsRcox), supervised principal components (SuperPC), generalized boosted regression modeling (GBM), and survival support vector machine (survival-SVM). The RSF model was implemented via the randomForestSRC package. RSF had two parameters ntree and mtry, where ntree represented the number of trees in the forest and mtry was the number of randomly selected variables for splitting at each node. We used a grid-search on ntree and mtry using leave-one-out cross-validation (LOOCV) framework. All the pairs of (ntree, mtry) are formed and the one with the best C-index value is identified as the optimized parameters. The Enet, Lasso, and Ridge were implemented via the glmnet package. The regularization parameter, λ, was determined by LOOCV, whereas the L1-L2 trade-off parameter, α, was set to 0-1 (interval =0.1). The stepwise Cox model was implemented via survival package. A stepwise algorithm using the AIC (Akaike information criterion) was applied, and the direction mode of stepwise search was set to "both", "backward", and "forward", respectively. The CoxBoost model was implemented via CoxBoost package, which is used to fit a Cox proportional hazards model by componentwise likelihood-based boosting. For the CoxBoost model, we used LOOCV routine optimCoxBoostPenalty function to first determine the optimal penalty (amount of shrinkage). Once this parameter was determined, the other tuning parameter of the algorithm, namely, the number of boosting steps to perform, was selected via the function cv.CoxBoost. The dimension of the selected multivariate Cox model was finally set by the principal routine CoxBoost. The plsRcox model was implemented via plsRcox package. The cv.plsRcox function was used to determine the number of components requested, and the plsRcox function was applied to fit a partial least squares regression generalized linear model. The SuperPC model was implemented via superpc package, is a generalization of principal component analysis, which generates a linear combination of the features or variables of interest that capture the directions of largest variation in a dataset. The superpc.cv function used a form of LOOCV to estimate the optimal feature threshold in supervised principal components. To avoid problems with fitting Cox models to small validation datasets, it uses the "pre-validation" approach. The GBM model was implemented via superpc package. Using the LOOCV, the cv.gbm function selected index for number trees with minimum cross-validation error. The gbm function was used to fit the generalized boosted regression model. The survival-SVM model was implemented via survivalsvm package. The regression approach takes censoring into account when formulating the inequality constraints of the support vector problem.

Cells infiltration estimation

The single sample gene set enrichment analysis (ssGSEA) implemented in R package GSVA was employed to quantify the relative infiltration of 28 immune cells in CRC[22]. Six other algorithms including TIMER, quanTIseq, MCP-counter, xCell, EPIC, and ESTIMATE, were further performed to verify the stability and robustness of the ssGSEA results.

Weighted gene co-expression network analysis (WGCNA)

Co-expression lncRNA networks were generated using WGCNA package. An appropriate soft threshold β was calculated to meet the criteria for scale-free network. Further, the weighted adjacency matrix was converted into a topological overlap matrix (TOM), and the corresponding dissimilarity was generated (1-TOM). The dynamic tree cutting approach was employed to conduct the module identification. To recognize lncRNAs modules significantly corelated with immune cluster, the module that displayed the highest correlation was selected for further study. lncRNAs with both high gene significance (GS) and module membership (MM) were defined as immune-related lncRNAs.

ImmLnc analysis framework

ImmLnc is an integrated algorithm for identifying lncRNA modulators of immune-related pathways. First, the ESTIMATE algorithm was used to infer tumor purity. Second, we calculated the partial correlation coefficient (PCC) between a specific lncRNA and all mRNAs by adjusting the tumor purity as a covariable. Final, all mRNAs were ranked by the correlation coefficient with a specific lncRNA, and the ranked gene list was further subjected to GSEA procedure to investigate whether the immune genes were enriched in the top or bottom of the gene list. As recommended, lncRES scores >0.995 and FDR <0.05 were considered statistically significant[9, 19].