/get_smarties

Dummy variable generation with fit/transform capabilities

Primary LanguagePythonMIT LicenseMIT

get_smarties

Like pd.get_dummies... but smarter.

The problem

When working with a categorical dataset, most use the pandas.get_dummies function for easy dummy variable generation. This is well and good, until you have to compare two subsets of your dataset (as in prediction). If your subsets don't have a row for each possible value for some feature, your resulting datasets will be different shapes.

For example, say we have a datset with a 'gender' with two possible values: Male and Female.

...gender
1...Male
2...Female
3...Male

The pd.get_dummies function would give you:

...gender_Malegender_Female
1...10
2...01
3...10

But now, say we have another instance and do some machine learning voodoo to predict their gender. Say we predict a male. get_dummies would give:

...gender_Male
1...1

Since Pandas never saw a Female in this subset, it only generates a category for Male. The result is that your new and original samples have different shapes, making all kinds of trouble for computing loss, for example.

See more discussion of this issue at this thread.

The solution

get_smarties allows you to easily generate dummy variables while persisting the possible values under each category for you. You can use conventional fit_transform and transform methods and solve this problem with virtually no additional effort, like so:

from get_smarties import Smarties
gs = Smarties()

# generate dummies on original dataset, store values for later
X = gs.fit_transform(data)

# generate more dummies on new sample using previously stored values
Y = gs.transform(prediction)

Pipelines

Because get_smarties has fit/transform capabilities, you can even inject your dummy variable creation directly sklearn pipelines:

training_pipeline = Pipeline([
    ('smarties', Smarties()),
    ('clf', MultinomialNB()),
])

training_pipeline.fit(data, labels)

See a working example with k-fold cross validation at kfold-pipeline-demo.ipynb.

Setup

With pip, simply run

pip install -e git+https://github.com/joeddav/get_smarties.git#egg=get_smarties