This repository contains reproducible python
source code for the final project of the team Do not miss your value at Skoltech Machine Learning course. Team members:
- Vladislav Molodtsov
- Irina Shushpannikova
- Stepan Vasilev
- Kelvin Kutsukutsa
- Zhadyraiym Akunova
All the experiments are issued in the form of pretty self-explanatory jupyter notebooks. For convenience, raw and processed datasets used in the experiments are placed in this repository as well. The structure of the repository should be preserved in order to run all code in notebooks without changing relative paths to the files. Obtained results are included into repository as .csv
files and .png
graphs and diagrams. For proper display of the pictures in this README, we recommend to switch to the Light theme in GitHub settings.
Data preprocessing.ipynb
-- preprocessing of raw datasets: drop missing values and useless columns, rename target asTarget
;Experiments.ipynb
-- main code for the pipelines of experiments; contains all implemented functions for adding noise, introducing missing values, imputing missing values, evaluating models and so on;Results processing.ipynb
-- notebook for processing obtained results and building graphs and diagrams;
We used Python 3.7.9
and the following versions of the libraries:
numpy 1.21.5
pandas 1.3.3
scipy 1.7.3
sklearn 1.0.2
matplotlib 3.3.2
lightgbm 3.3.2
miceforest 5.3.0
/Raw datasets
-- raw datasets for both regression and classification problems with links to the sources;/Datasets
-- processed datasets for both regression and classification problems; for convenience, the directory is divided into two parts:Classification
with datasets for classification problem andRegression
with datasets for regression problem; in total, there are 2 datasets for classification and 3 datasets for classification.
# | Dataset name | Problem | Description | Shape | Target |
---|---|---|---|---|---|
1 | Air temperatures | Regression | Predict air temperature by external data | (7588, 23) | Min value 17.4 Max value 38.9 |
2 | Air quality | Regression | Identify air quality by sensors data | (827, 13) | Min value 0.4 Max value 1.5 |
3 | Parkinson disease | Regression | Predict Parkinson disease by voice measurements | (5875, 22) | Min value 7 Max value 55 |
4 | Wine quality | Classification | Identify wine quality by physicochemical tests | (4898, 12) | 7 classes, from 5 to 2198 elements |
5 | Robot's sensors | Classification | Predict action by sensors data | (5455, 25) | 4 classes, from 328 to 2205 elements |
/Results
--.csv
files with the results of carried out experiments; contains 3 files:results_noise_only.csv
for experiments with noise only,results_drop_only.csv
for experiments with missing values only, andresults_noise_and_drop.csv
for experiments with both noise and missing values;/Graphs
--.png
files with graphs and diagrams reflecting the results of the experiments; there are 4 families of pictures:
noise_reg_T_dataset_N_M.png
-- dependency of models score on the noise level in the dataset, whereT
isTrue
for regression andFalse
for classification,N
is number of dataset,M
isSNR
for Additive White Gaussian Noise (AWGN) added to the dataset orp
for random changing every value in the dataset to the other one;drop4model_reg_T_dataset_N_drop_L_model_K.png
-- dependency of distortion metrics on the rate of introduced missing values , whereL
is1
,2
, or3
for different missing scenarios (Missing Completely At Random (MCAR), Missing At Random (MAR), Not Missing At Random (NMAR), respectively),K
is ML model number;drop_diagram_reg_T_dataset_N.png
-- radar diagrams for comparing different imputation methods in different missing scenarios;noisy_drop_diagram_reg_T_dataset_N_noise_Z_S.png
-- radar diagrams for comparing different imputation methods in different missing scenarios with different level of noise, whereZ
is1
for AWGN and2
for random changing,S
is level of noise in decibels or in dropping probability, respectively.
Report.pdf
-- written reportPresentation.pdf
-- presentation
Dataset name | Problem | Score name | Linear | DT | RF | LightGBM | |
---|---|---|---|---|---|---|---|
1 | Air temperatures | Regression | MAPE | 0.04 | 0.05 | 0.04 | 0.04 |
2 | Air quality | Regression | MAPE | 0.06 | 0.10 | 0.08 | 0.06 |
3 | Parkinson disease | Regression | MAPE | 0.11 | 0.12 | 0.11 | 0.11 |
4 | Wine quality | Classification | F1-micro | 0.52 | 0.51 | 0.53 | 0.52 |
5 | Robot’s sensors | Classificationч | F1-micro | 0.68 | 0.98 | 0.99 | 0.99 |
Radar diagrams for comparing different imputation methods in different missing scenarios with different level of noise:
- LightGBM - Gradient Boosting implementation;
- miceforest - MICE implemetation;
- adasegroup - ML course at Skoltech;
- Wasserstein2GenerativeNetworks - used example of good repo structure;