EpistasisLab/scikit-mdr

Confusion with documentation and MDR feature construction output

jay-reynolds opened this issue · 4 comments

Hi, in the first example in the README, it states:

"For example, MDR can be used to construct a new feature composed from two existing features:"

but "GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1" used in the example has 21 columns, not 2.

The resulting output is a single column, which is a single feature -- is it that there's a single feature produced because that's what those 21 columns boiled down to, or is it because only 2 features from the dataframe were selected and used to construct the new feature? Or is there another reason?

Thanks in advance! I will continue reading the MDR paper I found on pubmed in the meanwhile.

From the paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3500181/

"MDR pools genotypes into 'high-risk' and 'low-risk' or 'response' and 'non-response' groups in order to reduce multidimensional data into only one dimension."

And from the abstract (paper behind paywall): https://www.ncbi.nlm.nih.gov/pubmed/16457852

"To address this problem, we have previously developed a multifactor dimensionality reduction (MDR) method for collapsing high-dimensional genetic data into a single dimension (i.e. constructive induction) thus permitting interactions to be detected in relatively small sample sizes."

I suppose that answers my question.

Closing ticket.

Hi @jay-reynolds! I wanted to clarify this for you. In the example from the README:

from mdr import MDR
import pandas as pd

genetic_data = pd.read_csv('https://github.com/EpistasisLab/scikit-mdr/raw/development/data/GAMETES_Epistasis_2-Way_20atts_0.4H_EDM-1_1.tsv.gz', sep='\t', compression='gzip')

features = genetic_data.drop('class', axis=1).values
labels = genetic_data['class'].values

my_mdr = MDR()
my_mdr.fit(features, labels)
my_mdr.transform(features)
>>>array([[1],
>>>       [1],
>>>       [1],
>>>       ...,
>>>       [0],
>>>       [0],
>>>       [0]])

We are taking all of the features from the dataset (20 features in total) and constructing a single new feature from them. This is not a typical use of MDR, but it still works in this case because the example dataset is a fairly "easy" dataset for MDR.

Typically we use MDR in one of two ways:

  1. We know exactly what features we want to perform feature construction on, so we subset the DataFrame down to those features and provide only those features to MDR. The regression example in the README shows an example of this case.

  2. We don't know what features we want to perform feature construction on, so we perform an exhaustive combinatorial search of all possible feature combinations (typically up to tuples of 2 and 3 features) and provide each of those tuples to MDR separately, and choose the best tuple(s) according to some MDR quality metric (typically, 10-fold CV accuracy).

Thank you for the explanation, very much appreciated!

I've got TPOT going, so I think I'll give TPOT-MDR a go and see what it comes up with.

Have you tried using, say, hyperopt for combinatorial search instead of brute force or evolutionary methods?

Have you tried using, say, hyperopt for combinatorial search instead of brute force or evolutionary methods?

We haven't tried that, but would be very curious to see a demo of it!