
Datitos - TP2 with steroids

Best results

  • Private Score: 0.90000
  • Public Score: 0.90038
  • File: study16-predict-2022-01-23_07-44-35.csv

Training Workflow

To automate complete train process(training, reports generation, kaggle file) already exist an Apache Airflow Dag. An airflow dag is a data workflow that runs N parallel training processes and then run reports generation and kaggle result generation steps.



Dag Script

Required airflow global variables

  • project_path: /path/to/datitos/project
  • report_folds: 10
  • report_seeds_count: 30
  • train_cuda_process_memory_fraction: 0.1
  • train_device: gpu / cpu
  • train_folds: 5
  • train_optuna_db_url: mysql://root:1234@localhost/example (Database used by optuna to persist study state).
  • train_optuna_study: study16 (Optuna study name).
  • train_optuna_timeout: 8000 (Maximum time to wait for hyper parameters optimization).
  • train_optuna_trials: 300
  • train_workers_count: 4

Parallel Training

You can run a training into N workers. Each worker can be seen as a trial executor job. Each job train a model with a set of specific hyper params. All hyperparams -score pairs are stored into a maridb db. Finally you can load optuna study to get best hyperparams with hiest score. You can run a worker as next:


  • Each trainig process is a bin/train_model.py execution.
  • Optimization report step run bin/optmimization_report.py script.
  • Test model step run bin/test_model.py script.
  • See below to undestant how do each script.


$ conda activate datitos 
$ python bin/train_model.py --device gpu \
                            --study study3 \
                            --cuda-process-memory-fraction 0.1 \
                            --folds 5 \
                            --trials 300 \
                            --db-url mysql://root:1234@localhost/example \
                            --timeout 5000

To run 10 workers repeat previous command into 10 distinct shell sessions (bash/szh).

On the other hand, you can run workers that use CPU or GPU. Normally a good configuration could be N GPU workers and maybe 1 CPU worker, because CPU workers are high CPU consuming processes. This could be limited by the type of CPU, GPU and GPU and RAM memory. CPU workers parallelze k fold cross validation to decrese response time. GPU workers cant parallelize cv.


$ conda activate datitos 
$ python bin/train.py --device cpu \
                      --study study3 \
                      --folds 5 \
                      --trials 300 \
                      --db-url mysql://root:1234@localhost/example \
                      --timeout 5000

To monitor workers you can use any of next tools:

See script help:

$ python bin/train.py --help

Usage: train.py [OPTIONS]

  --device TEXT                   Device used to train and optimize model.
                                  Values: gpu, cpu.
  --study TEXT                    The study name.
  --trials INTEGER                Max trials count.
  --timeout INTEGER               maximum time spent optimizing hyper
                                  parameters in seconds.
  --db-url TEXT                   Mariadb/MySQL connection url.
  --cuda-process-memory-fraction FLOAT
                                  Setup max memory user per CUDA procees.
                                  Percentage expressed between 0 and 1
  --folds INTEGER                 Number of train dataset splits to apply
                                  cross validation.
  --help                          Show this message and exit.

Optimization report

Generate plots for an specified optuna study. Next you can see generated plots for study16 (Best accuracy):

Validation accuracy distribution


Optimizartion Contour diagram


Optimizartion EDF


Optimizartion history


Optimizartion parallel coordinates


Feature importance




Optimization trials accurary distribution



$ conda activate datitos
$ python bin/optmimization_report.py \
    --study study6 \
    --db-url mysql://root:1234@localhost/example \
    --device gpu \
    --seeds-count 3 \
    --folds 2

See script help:

$ python  bin/optmimization_report.py --help

Usage: optmimization_report.py [OPTIONS]

  --device TEXT          Device used to train and optimize model. Values: gpu,
  --study TEXT           The study name.
  --db-url TEXT          Mariadb/MySQL connection url.
  --report-path TEXT     Path where save optimization plots.
  --seeds-count INTEGER  seeds count used calculate acuracy distribution
  --folds INTEGER        Number of train dataset splits to apply cross
  --help                 Show this message and exit.

Test model

It script runs N model training instances using hyperparameters of optimization trial with best accurary. Then gets model with highest accuracy and predict under kaggle test file. Finally genera kaggle file to upload.

$ conda activate datitos
$ python  bin/test_model.py \
    --study study6 \
    --db-url mysql://root:1234@localhost/example \
    --device gpu

See script help:

$ python bin/test_model.py --help

Usage: test_model.py [OPTIONS]

  --device TEXT       Device used to train and optimize model. Values: gpu,
  --study TEXT        The study name.
  --db-url TEXT       Mariadb/MySQL connection url.
  --result-path TEXT  path where test predictions are saved.
  --help              Show this message and exit.