improvements-to-ml-algorithms

This is repository of project: "Methods of detecting anomalies for improving the quality of machine learning algorithms."

Due to the inability of anomaly detection algorithms to work with categorical values, NaNs: encoding and imputation models have been presented. The usage of these methods is necessary for the process of anomaly detection algorithms - the study of the effect of these methods is the purpose of this research.

Pipeline

Total number of combinations of encoding, imputation and anomaly detection models: 111.

Performance evaluating & description of datasets

The performance of the methods is considered by the effect on the final RMSE when solving the regression problem by LightGBM. The RMSE mean and standard deviation of 10-fold Cross-Validation are used.

Evaluation metrics were calculated on: "house-prices-advanced-regression-techniques". Missing values: 6965 / 115340, (6.039%). The following ratio of "NaN" data are used:

6% of missed data (Initial dataset)
30% of missed data
50% of missed data

(In dataset "house-prices-advanced-regression-techniques" there are missing data in each row. Complete-Case Analysis is not available.)

The results of the experiments are available in the file: house-prices-scheme-experiments.xlsx

Expirements

Best combinations for anomalies detection algorithms:

RMSE	The importance of the anomaly indicator	The importance of the anomaly indicator
6% of missing data (Initial)
1. Default model	27582.549 (5511.916)	-
2. Target Encoding + Median imputation	27535.075 (5445.082)	-
3. Elliptic Envelope (Target Encoding + Median imputation)	27517.042 (5544.918)	0.0000518 (0.0000541)
4. One-Class SVM (Target Encoding + Median imputation)	27467.854 (5394.370)	-0.0000723 (0.0001001)
5. Isolation Forest (Target Encoding + Mode imputation)	27451.540 (5365.327)	-0.0016849 (0.0059135)
6. LocalOutlierFactor( Target Encoding + K-NN)	27537.623 (5437.554)	0.0000096 (0.0000337)
7. (Best) MICE	27391.502 (5490.600)	-

30% of missing data
1. Default model	33165.202 (4617.468)	-
2. Target Encoding + MICE	30188.023 (5813.186)	-
3. Elliptic Envelope (Target Encoding + MICE)	30094.619 (5741.747)	-0.0003155 (0.0000924)
4. One-Class SVM(Target Encoding + MICE)	30142.952 (5766.397)	-0.0004070 (0.0003780)
5. (Best) Isolation Forest (Target Encoding + MICE)	30037.433 (5642.734)	-0.0095051 (0.0035107)
6. LocalOutlierFactor (Target Encoding + MICE)	30108.693 (5764.039)	-0.0000030 (0.0000933)


50% of missing data
1. Default model	40571.481 (5936.180)	-
2. Target Encoding + MICE	35892.779 (5715.790)	-
3. Elliptic Envelope (Target Encoding + MICE)	35725.265 (5618.712)	0.0001082 (0.0001855)
4. (Best) One-Class SVM(Target Encoding + MICE)	35629.558 (5657.682)	-0.0003512 (0.0002492)
5. Isolation Forest (Target Encoding + MICE)	35720.589 (5642.171)	-0.0181765 (0.0067545)
6. LocalOutlierFactor (Target Encoding + MICE)	35775.763 (5683.544)	0.0001991 (0.0001781)

Results

Using the pipeline described above is beneficial because the RMSE values are improving in all cases.
Best preprocessing combination: Target Encoding + MICE
Anomaly detection methods reduce RMSE compared to cases with default data / encoding & imputation only.
Difference in previous & current ways to process categorical features

(Unclear) With the data being dropped, there is no increase in the std of RMSE. Does it mean that the dataset has enough data and well described distributions?

george-nigm/Improvements-to-ML-algorithms

improvements-to-ml-algorithms

Pipeline

Performance evaluating & description of datasets

Expirements

Results