This is first of the three projects required for fullfilment of the Nanodegree Machine Learning Engineer with Microsoft Azure from Udacity. In this project, we build and optimize an Azure ML pipeline using the Python SDK and a provided Scikit-learn model. This model is then compared to an Azure AutoML run.
You can find more information about Azure AutoML here:
The data used in this project is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y). It consists of 20 input variables (columns) and 32,950 rows with 3,692 positive classes and 29,258 negative classes.
The data used in this project can be found here:
Detailed description of the dataset can be found here:
Explain the pipeline architecture, including data, hyperparameter tuning, and classification algorithm. We use Logistric Regression algorithm from the SKLearn framework in conjuction with hyperDrive for hyperparameter tuning.
Pipeline consists of following steps:
- Data collection
- Data cleaning
- Data splitting
- Hyperparameter sampling
- Model training
- Model testing
- Early stopping policy evaluation
- Saving the model
We use a script train.py, to govern steps 1-3, 5, 6 and 8. Whereas step 4 and 7 is governed by hyperDrive. The overall execution of the pipeline is managed by hyperDrive. A brief description of each step is provided below.
Data collection
Dataset is collected from the link provided earlier, using TabularDatasetFactory.
Data cleaning
This process involves dropping rows with empty values and one hot encoding for categorical columns.
Data splitting
As a standard practice, datasets are split into train and test sets. This splitting of a dataset is helpful to validate/tune our model. For this experiment we split 70-30, 70% for training and 30% for testing.
Hyperparameter selection
Hyperparamters are adjustable parameters that let you control the model training process. This is a recurring step for each iteration of model training, controlled by hyperDrive.
There are two hyperparamters for this experiment, C and max_iter. C is the inverse regularization strength whereas max_iter is the maximum iteration to converge for the SKLearn Logistic Regression.
We have used random parameter sampling to sample over a discrete set of values. Random parameter sampling is great for discovery and getting hyperparameter combinations that you would not have guessed intuitively, although it often requires more time to execute.
The parameter search space used for C is [1,2,3,4,5]
and for max_iter is [80,100,120,150,170,200]
Model training
Once we have train and test dataset available and have selected our hyperparameters for a particular iteration, we are all set for training our model. This process is also called as model fitting.
Model testing
Test dataset from previous split is used to test the trained model, metrics are generated and logged, these metrics are then used to benchmark the model. In our case we are using accuracy as model performance benchmark.
Early stopping policy evaluation
The benchmark metric from model testing is then evaluated using hyperDrive early stopping policy. Execution of the pipeline is stopped if conditions specified by the policy are met.
We have used the BanditPolicy. This policy is based on slack factor/slack amount and evaluation interval. Bandit terminates runs where the primary metric is not within the specified slack factor/slack amount compared to the best performing run. This helps to improves computational efficiency.
For this experiment the configuratin used is; evaluation_interval=1
, slack_factor=0.2
, and delay_evaluation=5
. This configration means that the policy would be applied to every 1*5
iteration of the pipeline and if 1.2*
value of the benchmark metric for current iteration is smaller than the best metric value so far, the run will be cancelled.
Saving the model
The trained model is then saved, this is important if we want to deploy our model or use it in some other experiments.
AutoML uses the provided dataset to fit on a wide variety of algorithms. It supports classification, regression and time-series forecasting problems. An exit criterion is specified to stop the training which ensures that resources are not used further once the objectives are met, this saves cost also.
In our experiment we found out VotingEnsemble to be the best model based on the accuracy metric. The accuracy score for this model was 0.9169044006069802
.
The VotingEnsemble consisted of six algorithms; the algorithms, their corresponding weightages and a few of the individual parameters including learning_rate
, n_estimators
, and random_state
are summarized in the table below. Further details of each individual algorithm can be found in the corresponding Jupyter Notebook.
Algorithm | Weightage | learning_rate | n_estimators | random_state |
---|---|---|---|---|
xgboostclassifier with maxabsscaler | 0.06666666666666667 | 0.1 | 100 | 0 |
lightgbmclassifier with maxabsscaler | 0.4666666666666667 | 0.1 | 100 | None |
xgboostclassifier with sparsenormalizer | 0.2 | 0.1 | 25 | 0 |
sgdclassifierwrapper with minmaxscaler | 0.06666666666666667 | constant | - | None |
sgdclassifierwrapper with standardscalerwrapper | 0.06666666666666667 | constant | - | None |
sgdclassifierwrapper with standardscalerwrapper | 0.13333333333333333 | balanced | - | None |
A voting ensemble is an ensemble machine learning model that combines the predictions from multiple other models.
The model generated by AutoML had accuracy slighlty higher than the HyperDrive model. 0.9169044006069802
for autoML and 0.912797167425392
for HyperDrive
The architecture is different as hyperDrive was restricted to Logistic Regression from SKLearn, whereas AutoML has access to wide variety of algorithms.
In some scenarios a certain model may not be suited best, hence this puts hyperDrive at a disadvantage as model slection is at the hand of the user which is not the case with AutoML. Hence the difference in accuracy is explainable.
Improvements for hyperDrive
- Use Bayesian Parameter Sampling instead of Random; Bayesian sampling tries to intelligently pick the next sample of hyperparameters, based on how the previous samples performed, such that the new sample improves the reported primary metric.
- We could use different primary metric as sometimes accuracy alone doesn't represent true picture of the model performance.
- Increasing max total runs to try a lot more combinations of hyperparameters, this would have an impact on cost too.
Improvements for autoML
- Change experiment timeout, this would allow for more model experimentation but the longer runs may cost you more.
- We could use different primary metric as sometimes accuracy alone doesn't represent true picture of the model performance.
- Incresing the number of cross validations may reduce the bias in the model.
- Address class imbalance, there are 3,692 positive classes whereas 29258 negative classes. This will reduce the model bias.