getML - Automated Feature Engineering for Relational Data and Time Series

Introduction

getML is a tool for automating feature engineering on relational data and time series. It includes a specifically customized database Engine for this very purpose.

This results in a speedup between 60 to 1000 times (see Benchmarks) over other open-source tools like featuretools and tsfresh for automated feature engineering. Also check out our demonstrational notebooks to see more comparisons.

Introduction
Table of Contents
Quick Start
Key benefits for using getML
- Features generate by getML
- Documentation
Benchmarks
Demo notebooks
Example
Release Notes

Quick Start

As getML is available on PyPI, you can install it simply via

pip install getml

Check out the Example and the demonstrational notebooks to get started with getML. A detailed walkthrough guide and the documentation will also help you on your way with getML.

To learn, how to build and contribute to getML, check out BUILD.md for instructions on how to build getML from source.

Key benefits of using getML

One big key feature over other tools like featuretools, tsfresh and prophet is the runtime performance. Our own implementation of propositionalization, FastProp (short for fast propositionalization), reaches improvements of about 60 to 1000 times faster run times (see specifically FastProp Benchmarks within our notebooks). This leads to faster iterations for data scientists, giving them more time to tweak variables to achieve even better results.

FastProp is not only faster, but can also provide an increased accuracy.

For even better accuracy, getML provides advanced algorithms in its professional and Enterprise feature-sets, namely Multirel, Relboost, Fastboost and RelMT.

The standard version includes preprocessors (like CategoryTrimmer, EmailDomain, Imputation, Mapping, Seasonal, Substring, TextFieldSplitter), predictors (like LinearRegression, LogisticRegression, XGBoostClassifier, XGBoostRegressor) and hyperparameter optimizer (like RandomSearch, LatinHypercubeSearch, GaussianHyperparameterSearch).

It also gives access to the getML Monitor, which provides valuable information about projects, pipelines, features, important columns, accuracies, performances, and more. Those information give insights and help understand and improve the results.

getML can import data from various sources like CSV, Pandas, JSON, SQLite, MySQL, MariaDB, PostgreSQL, Greenplum, ODBC.

While the standard version is open source, can be run on your local machine, and gets basic support via EMail and via this repository, it must not be used for productive purposes. The professional and Enterprise versions in contrast allows productive uses, gets also support via phone and chat, offers training sessions, as well as on-premise and cloud hosting, and export and deployment features. Get in contact via email or directly schedule a meeting.

Features generate by getML

getML generates features for relational data and time series. These include, but are not limited to:

Various aggregations, i.e. average, sum, minimum, maximum, quantiles, exponentially weighted moving average, trend, exponentially weighted trends, ...
Aggregations within a certain time frame, i.e. maximum in the last 30 days, minimum in the last 7 days
Seasonal factors from time stamps, such as month, day of the week, hour, ...
Seasonal aggregations, i.e. maximum for the same weekday as the prediction point, minimum for the same hour as the prediction point, ...

In other words, it generates the kind of features you would normally build manually. But it automatically generates thousands of features and then automatically picks the best, saving you a lot of manual work.

Documentation

Check out the full documentation on https://getml.com/latest

Benchmarks

We evaluated the performance of getML's FastProp algorithm against five other open-source tools for automated feature engineering on relational data and time series: tsflex, featuretools, tsfel, tsfresh and kats. The datasets used include:

Air Pollution
- Hourly data on air pollution and weather in Beijing, China.
Interstate94
- Hourly data on traffic volume on the Interstate 94 from Minneapolis to St. Paul.
Dodgers
- Five-minute measurements of traffic near Los Angeles, affected by games hosted by the LA Dodgers.
Energy
- Ten-minute measurements of the electricity consumption of a single household.
Tetouan
- Ten-minute electricity consumption of three different zones in Tetouan City, north Morocco

The plots shown below contain the runtime per feature calculated relative to the runtime per feature of the fastest approach. The fastest approach turns out to be the getML's FastProp, so it gets a value 1.

We observe, that for all datasets, the features produced by the different tools are quite similar, but getML is 60-1000 times faster than other open-source tools.

In fact, the speed-up is so big that we need a logarithmic scale to even see the bar for getML.

To reproduce those results, refer to the benchmarks folder in this repository.

Demo notebooks

To experience getML in action, the following example notebooks are provided in the demo-notebooks directory:

Notebook	Prediction Type	Population Size	Data Type	Target	Domain	Difficulty	Comments
adventure_works.ipynb	Classification	19,704	Relational	Churn	Customer loyalty	Hard	Good reference for a complex data model
formula1.ipynb	Classification	31,578	Relational	Win	Sports	Medium
interstate94.ipynb	Regression	24,096	Time Series	Traffic	Transportation	Easy	Good notebook to get started on time series
loans.ipynb	Classification	682	Relational	Default	Finance	Easy	Good notebook to get started on relational data
robot.ipynb	Regression	15,001	Time Series	Force	Robotics	Medium
seznam.ipynb	Regression	1,462,078	Relational	Volume	E-commerce	Medium

For an extensive list of demonstrational and benchmarking notebooks, have a look at our docs examples section or the notebook source repository.

Example

Here is how you can build a complete Data Science pipeline for a time series problem with seasonalities with just a few lines of code:

import getml

getml.engine.launch()
getml.set_project("interstate94")

# Load the data.
traffic = getml.datasets.load_interstate94(roles=False, units=False)

# Set the roles, so getML knows what you want to predict
# and which columns you want to use.
traffic.set_role("ds", getml.data.roles.time_stamp)
traffic.set_role("holiday", getml.data.roles.categorical)
traffic.set_role("traffic_volume", getml.data.roles.target)

# Generate a train/test split using 2018/03/15 as the cutoff date.
split = getml.data.split.time(traffic, "ds", test=getml.data.time.datetime(2018, 3, 15))

# Set up the data:
# - We want to predict the traffic volume for the next hour.
# - We want to use data from the seven days before the reference date.
# - We want to use lagged targets (autocorrelated features are allowed).
time_series = getml.data.TimeSeries(
    population=traffic,
    split=split,
    time_stamps="ds",
    horizon=getml.data.time.hours(1),
    memory=getml.data.time.days(7),
    lagged_targets=True,
)

# The Seasonal preprocessor extracts seasonal
# components from the time stamps.
seasonal = getml.preprocessors.Seasonal()

# FastProp extracts features from the time series.
fast_prop = getml.feature_learning.FastProp(
    loss_function=getml.feature_learning.loss_functions.SquareLoss,
    num_threads=1,    
    num_features=20,
)

# Use XGBoost for the predictions (it comes out-of-the-box).
predictor = getml.predictors.XGBoostRegressor()

# Combine them all in a pipeline.
pipe = getml.pipeline.Pipeline(
    tags=["memory: 7d", "horizon: 1h", "fast_prop"],
    data_model=time_series.data_model,
    preprocessors=[seasonal],
    feature_learners=[fast_prop],
    predictors=[predictor],
)

# Fit on the train set and evaluate on the testing set.
pipe.fit(time_series.train)
pipe.score(time_series.test)
predictions = pipe.predict(time_series.test)

To see the full example, check out the Interstate94 notebook (interstate94.ipynb).

Release Notes

See CHANGELOG.md for release notes.

getml/getml-community