iteremoval

The package provides a flexible algorithm to iteratively screen features in consideration of overfitting and overall performance. Two distinct groups of observations are required. It is originally tailored for methylation locus screening of NGS data, and it can also be used as a generic method for feature selection. Each step of the algorithm provides a default function for simple implemention, and it is able to be replaced by a user defined method.

Install

To install this package, start R and enter:

# try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("iteremoval")

An example

We identified 299 genomic regions related to the methylation status of a disease. Then, we sequenced the cfDNA of 44 subjects. 22 were malignant, while others were normal. We built a statistical model to compute the probability of the disease for each region and subject. Thus, we use the probability data to demonstrate how the package works.

The input files are in either format:

  1. Two datasets, SWRG1(malignant individuals) and SWRG0(normal individuals). Each row of the two datasets indicates a genomic region, and the columns are the probabilities.

  2. A SummarizedExperiment::SummarizedExperiment object (SummarizedData) and a logical vector (SummarizedData$Group==0) to distinguish normal samples from all samples. Note: Name 'Group' is not required.

We expect that higher the value means higher the probability of having a disease. However, we discover that not all genes are related to the disease. Therefore, we use the package iteremoval to select the gene locus with high probability in malignant samples and low probability in normal samples.

iteremoval is oriented to find a dataset to classify two different groups, and the feature that removed in each iteration is comprehensively considered according to all observations. If the observations have subtypes, the process might generate a feature set favoring all subtypes with the default settings. You can also define the function that each step uses to meet your requirement.

Removing features

library(iteremoval)
# input two datasets
removal.stat <- feature_removal(SWRG1, SWRG0, cutoff1=0.95, cutoff0=0.95, 
								 offset=c(0.25, 0.5, 2, 4))
# input SummarizedExperiment object
removal.stat <- feature_removal(SummarizedData, SummarizedData$Group==0, 
								 cutoff1=0.95, cutoff0=0.95, 
								 offset=c(0.25, 0.5, 2, 4))

It is the core function of the package. The first four parameters are required. A vector of offset means doing the whole computational process with different offsets respectively, and you can also define only one numeric number to offset. To get a overall information of features, the iteration will not stop until there is no feature to remove.

  • You can also lower cutoff0 and higher cutoff1 to reduce overfitting. Why? You can type ?feature_removal to see the whole algorithm.

Ploting the iteration trace of removed features' scores

It is useful to visulize how each feature is removed in the iterative process. The package provides a way to plot the scores of features being removed in each iteration.

ggiteration_trace(removal.stat) + theme_bw()

ggiteration_trace returns a ggplot2 object, so + is used in the same way as ggplot2 package.

X-axis is the iteration index, and Y-axis is the score of the feature being removed, and the legend is the offset you passed to feature_removal. Normally, you can stop removing features from the index of which the scores fluctuate drastically.

Generating the feature list

Once you confirm the cutoff of iteration index, you can generate the feature list. Since using a vector of offset and the features are removed with different cutoff seperately, the remaining feature lists are not unique for multiple cutoffs. Thus, we compute the feature prevalence for the feature lists, using:

features <- feature_prevalence(removal.stat, index=255, hist.plot=TRUE)
features

Screening the features with prevalence

At last, you can see the prevalence histogram of the features because of multiple offsets. You can specify a cutoff for the prevalence, and output the final feature list.

feature_screen(features, prevalence=4)