iteremoval
The package provides a flexible algorithm to iteratively screen features in consideration of overfitting and overall performance. Two distinct groups of observations are required. It is originally tailored for methylation locus screening of NGS data, and it can also be used as a generic method for feature selection. Each step of the algorithm provides a default function for simple implemention, and it is able to be replaced by a user defined method.
Install
To install this package, start R and enter:
# try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("iteremoval")
An example
We identified 299 genomic regions related to the methylation status of a disease. Then, we sequenced the cfDNA of 44 subjects. 22 were malignant, while others were normal. We built a statistical model to compute the probability of the disease for each region and subject. Thus, we use the probability data to demonstrate how the package works.
The input files are in either format:
-
Two datasets,
SWRG1
(malignant individuals) andSWRG0
(normal individuals). Each row of the two datasets indicates a genomic region, and the columns are the probabilities. -
A
SummarizedExperiment::
SummarizedExperiment object (SummarizedData
) and a logical vector (SummarizedData$Group==0
) to distinguish normal samples from all samples. Note: Name 'Group' is not required.
We expect that higher the value means higher the probability of having a disease. However, we discover that not all genes are related to the disease. Therefore, we use the package iteremoval
to select the gene locus with high probability in malignant samples and low probability in normal samples.
iteremoval
is oriented to find a dataset to classify two different groups, and the feature that removed in each iteration is comprehensively considered according to all observations. If the observations have subtypes, the process might generate a feature set favoring all subtypes with the default settings. You can also define the function that each step uses to meet your requirement.
Removing features
library(iteremoval)
# input two datasets
removal.stat <- feature_removal(SWRG1, SWRG0, cutoff1=0.95, cutoff0=0.95,
offset=c(0.25, 0.5, 2, 4))
# input SummarizedExperiment object
removal.stat <- feature_removal(SummarizedData, SummarizedData$Group==0,
cutoff1=0.95, cutoff0=0.95,
offset=c(0.25, 0.5, 2, 4))
It is the core function of the package. The first four parameters are required. A vector of offset
means doing the whole computational process with different offsets respectively, and you can also define only one numeric number to offset
. To get a overall information of features, the iteration will not stop until there is no feature to remove.
- You can also lower
cutoff0
and highercutoff1
to reduce overfitting. Why? You can type?feature_removal
to see the whole algorithm.
Ploting the iteration trace of removed features' scores
It is useful to visulize how each feature is removed in the iterative process. The package provides a way to plot the scores of features being removed in each iteration.
ggiteration_trace(removal.stat) + theme_bw()
ggiteration_trace
returns a ggplot2 object, so +
is used in the same way as ggplot2
package.
X-axis is the iteration index, and Y-axis is the score of the feature being removed, and the legend is the offset you passed to feature_removal
. Normally, you can stop removing features from the index of which the scores fluctuate drastically.
Generating the feature list
Once you confirm the cutoff of iteration index, you can generate the feature list. Since using a vector of offset
and the features are removed with different cutoff
seperately, the remaining feature lists are not unique for multiple cutoffs. Thus, we compute the feature prevalence for the feature lists, using:
features <- feature_prevalence(removal.stat, index=255, hist.plot=TRUE)
features
Screening the features with prevalence
At last, you can see the prevalence histogram of the features because of multiple offsets. You can specify a cutoff for the prevalence, and output the final feature list.
feature_screen(features, prevalence=4)