k*-fold cross-validation in protein subcellular localization

Description

Perform k-fold cross-validation for tuning the following parameters of a random forest model.

ntree: 10
mtry: 75
maxnodes: 20

cmd

k_fold(k, './data/Archaeal_tfpssm.csv', 'performance.csv')

k-fold cross-validation

Divide the data into k parts, the number of parts used by each data set
- (training, validation, testing) = (k-2, 1, 1)
The following shows the example of the 5-fold cross validation.

Input: Archaeal_tfpssm.csv

📁 Archaeal_tfpssm.csv download

This CSV doesn't contain a header. The information of columns as below:

V2: labels of proteins
- CP: Cytoplasmic
- CW: Cell Wall
- EC: Extracellular
- IM: Inner membrane
V3 ~ V5602: the gapped-dipeptide features of each protein

Output format: performance.csv

accuracy = P/N, average of k-fold cross-validation

set	training	validation	test
fold1	0.93	0.91	0.88
fold2	0.92	0.91	0.89
fold3	0.94	0.92	0.90
fold4	0.91	0.89	0.87
fold5	0.90	0.92	0.87
ave.	0.92	0.91	0.88

Code for reference

library(randomForest)

k_fold <- function(fold, input_file, output_file){
  
  # model using random forest & tune best parameters
  model <- randomForest(ntree, mtry, maxnodes)
  # make confusion matrix tabel
  resultframe <- data.frame(truth=tmp$V2,
                            pred=predict(model, type="class"))
  # output the confusion matrix                        
  write.csv()

  return (your_model)
}

References

Please list the code and its reference.

If needed, you should explain the details, i.e., comment like # ChatGPT, respond to “your prompt,” February 16, 2023.

Data Set:

Chang, J.-M. M. et al. (2013) Efficient and interpretable prediction of protein functional classes by correspondence analysis and compact set relations. PLoS ONE 8, e75542.
Chang J-M, Su EC-Y, Lo A, Chiu H-S, Sung T-Y, & Hsu W-L (2008) PSLDoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. Proteins: Structure, Function, and Bioinformatics 72(2):693-710.

Code:

Rita94105/k-fold