/scikit-partial

Pipeline components that support partial_fit.

Primary LanguagePythonMIT LicenseMIT

scikit-partial

Pipeline components that support partial_fit.

The goal of scikit-partial is to offer a pipeline that can run partial_fit. This allows of online learning on an entire pipeline.

Installation

You can install everything with pip:

python -m pip install --upgrade pip
python -m pip install scikit-partial

Usage

Assuming that you use a stateless featurizer in your pipeline, such as HashingVectorizer or language models from whatlies, you choose to pre-train your scikit-learn model beforehand and fine-tune it later using models that offer the .partial_fit()-api. If you're unfamiliar with this api, you might appreciate this course on calmcode.

import pandas as pd
from sklearn.linear_model import SGDClassifier
from sklearn.feature_extraction.text import HashingVectorizer

from skpartial.pipeline import make_partial_pipeline

url = "https://raw.githubusercontent.com/koaning/icepickle/main/datasets/imdb_subset.csv"
df = pd.read_csv(url)
X, y = list(df['text']), df['label']

# Construct a pipeline with components that are `.partial_fit()` compatible
pipe = make_partial_pipeline(HashingVectorizer(), SGDClassifier(loss="log"))

# Run the learning algorithm on batches of data
for i in range(10):
    # We could also do a whole bunch of data augmentation here!
    pipe.partial_fit(X, y, classes=[0, 1])

When is this pattern useful? Let's consider spelling errors. Suppose that we'd like our algorithm to be robust against typos. Then we can simulate typos on our X inside of our learning loop.

Supported Components

The following pipeline components are added.

from skpartial.pipeline import (
    PartialPipeline,
    PartialFeatureUnion,
    make_partial_pipeline,
    make_partial_union,
)

These tools allow you to declare pipelines that support .partial_fit(). Note that components used in these pipelines all need to have .partial_fit() implemented.