Machine-Learning: A Jupyter Notebook repository from jay-pm

Machine-Learning

1. Missing data
Why bother about missing data?

how you handle missing values can introduce bias, so handling it appropriately will reduce that probability.
Most ML algorithms require complete data (else an error is generated)
Note: the best approach is to try out all below techniques for filling missing value and adopt the one which has least impact on variance of dataset.
Omission (removing rows or columns can remove too much of data)
- remove rows >> .dropna(axis=0)
- remove columns >> .dropna(axis =1)
Imputation
- fill with zero >> SimpleImputer(strategy = 'constant', fill_value = 0) [Bias results downwards]
- fill with mean >> SimpleImputer(strategy = 'mean') [affected more by outliers]
- fill with median >> SimpleImputer(strategy = 'median') [better in case of outliers]
- fill with mode >> SimpleImputer(strategy = 'most_frequent') [have varying degree of helpfullness]

1. Feature selection
Selecting correct features is important because it:

reduces overfitting (by removing unimportant features that contributes noise but no information)
improves accuracy (since any potentially misleading data is removed)
increases interpretability (because the model is less complex)
reduces training time (less data takes less time to train)

, Regularization, Feature engineering

Cluster algorithm selection, Feature extraction, Dimension reduction

Model generalization and evaluation, Model selection XXX