SDSJ AutoML — AutoML(automatic machine learning) competition aimed at development of machine learning systems for processing banking datasets: transactions, time-series as well as classic table data from real banking operations. Processing is handled automatically by the system with models selection, architecture, hyper-parameters, etc.
Dmitriy Kulagin, Yauheni Kachan, Nastassia Smolskaya, Vadim Yermakov
- Drop constant columns
- Add time-shifted columns
- Features from datetime columns (year, weekday, month, day)
- Smoothed target encoding (Semenov encoding) for
string
andid
columns and if the dataset has more than 1000 rows else and fornumeric
with less than 31 unique values - If dataset's size bigger than 250Mb than convert data to
np.float32
data-type
If dataset has less than 1000 rows (e.g. first dataset) and regression
problem than train linear model else gradient boosting model(s)
- Fill missing values with mean
- Transform data with QuantileTransformer
- Train Lasso with regularization term alpha = 0.1
- Search for best alpha for Lasso, Ridge and select best of them:
- Cross-validation:
time_series_split.TimeSeriesCV
withmin(6, number_of_rows / 30)
folds ifdatetime_0
in dataset else use KFold with3
folds - Grid search alpha for Lasso in range
np.logspace(-2, 0, n_points)
wheren_points
is min of 35 and estimation of how many times we could train the model on all folds - Grid search alpha Ridge if by estimation we could train more than 2 times on all folds, search for alpha in
np.logspace(-2, 2, n_points)
range
- Cross-validation:
- If we successfully grid search than select best of Lasso and Ridge else use Lasso from 2.
- Train few iterations of XGBoost with 700 trees (
early_stopping_rounds=20
) with continuation until we have enough time for next iteration, or early stopping achieved - If XGBoosts trains fewer two-thirds of available time than train LightGBM with 5000 trees (
early_stopping_rounds=20
) - If XGBoost and LightGBM trained successfully than stacking them with Logistic Regression or Ridge according to the prediction problem
Public datasets for local validation: sdsj2018_automl_check_datasets.zip
docker pull rekcahd/sdsj2018