Consider TPOT your Data Science Assistant. TPOT is a Python tool that automatically creates and optimizes machine learning pipelines using genetic programming.
TPOT will automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.
An example Machine Learning pipeline
Once TPOT is finished searching (or you get tired of waiting), it provides you with the Python code for the best pipeline it found so you can tinker with the pipeline from there.
TPOT is built on top of scikit-learn, so all of the code it generates should look familiar... if you're familiar with scikit-learn, anyway.
TPOT is still under active development and we encourage you to check back on this repository regularly for updates.
For further information about TPOT, please see the project documentation.
Please see the repository license for the licensing and usage information for TPOT.
Generally, we have licensed TPOT to make it as widely usable as possible.
We maintain the TPOT installation instructions in the documentation. TPOT requires a working installation of Python.
TPOT can be used on the command line or with Python code. Click on the corresponding links to find more information on TPOT usage in the documentation.
Below is a minimal working example with the practice MNIST data set.
from tpot import TPOT
from sklearn.datasets import load_digits
from sklearn.cross_validation import train_test_split
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
train_size=0.75, test_size=0.25)
tpot = TPOT(generations=5)
tpot.fit(X_train, y_train)
print(tpot.score(X_train, y_train, X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')
Running this code should discover a pipeline that achieves ~98% testing accuracy, and the corresponding Python code should be exported to the tpot_mnist_pipeline.py
file and look similar to the following:
import numpy as np
import pandas as pd
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
# NOTE: Make sure that the class is labeled 'class' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR')
training_indeces, testing_indeces = next(iter(StratifiedShuffleSplit(tpot_data['class'].values, n_iter=1, train_size=0.75, test_size=0.25)))
result1 = tpot_data.copy()
# Perform classification with a logistic regression classifier
lrc1 = LogisticRegression(C=2.8214285714285716)
lrc1.fit(result1.loc[training_indeces].drop('class', axis=1).values, result1.loc[training_indeces, 'class'].values)
result1['lrc1-classification'] = lrc1.predict(result1.drop('class', axis=1).values)
We welcome you to check the existing issues for bugs or enhancements to work on. If you have an idea for an extension to TPOT, please file a new issue so we can discuss it.
Before submitting any contributions, please review our contribution guidelines.
Please check the existing open and closed issues to see if your issue has already been attended to. If it hasn't, file a new issue on this repository so we can review your issue.
If you use TPOT in a scientific publication, please consider citing at least one of the following papers:
R. S. Olson et al. Automating biomedical data science through tree-based pipeline optimization. In G. Squillero and P. Burelli, editors, Proceedings of the 18th European Conference on the Applications of Evolutionary and Bio-inspired Computation, Lecture Notes in Computer Science, Berlin, Germany, 2016. Springer-Verlag.
BibTeX entry:
@inproceedings{Olson2016EvoBIO,
author = {Olson, Randal S. and Urbanowicz, Ryan J. and Andrews, Peter C. and Lavender, Nicole A. and Kidd, La Creis and Moore, Jason H.},
title = {Automating biomedical data science through tree-based pipeline optimization},
booktitle = {Proceedings of the 18th European Conference on the Applications of Evolutionary and Bio-inspired Computation},
series = {Lecture Notes in Computer Science},
year = {2016},
location = {Porto, Portugal},
numpages = {16},
editor = {Squillero, G and Burelli, P},
publisher = {Springer-Verlag},
address = {Berlin, Germany}
}
TPOT was developed in the Computational Genetics Lab with funding from the NIH. We're incredibly grateful for their support during the development of this project.