This is the final project of the Udacity Azure ML Nanodegree.
In this project, we will create two models on the 'IBM HR Analytics Employee Attrition & Performance' dataset: One using Automated ML (denoted as AutoML from now on) and one customized model whose hyperparameters are tuned using HyperDrive. We will then compare the performance of both the models and deploy the best performing model.
Attrition has always been a major concern in any organization. The IBM HR Attrition Case Study is a fictional dataset that aims to identify important factors that might be influential in determining which employee might leave the firm and who may not.
- Age
- Attrition
- BusinessTravel
- DailyRate
- Department
- DistanceFromHome
- Education
- EducationField
- EmployeeCount
- EmployeeNumber
- EnvironmentSatisfaction
- Gender
- HourlyRate
- JobInvolvement
- JobLevel
- JobRole
- JobSatisfaction
- MaritalStatus
- MonthlyIncome
- MonthlyRate
- NumCompaniesWorked
- Over18
- OverTime
- PercentSalaryHike
- PerformanceRating
- RelationshipSatisfaction
- StandardHours
- StockOptionLevel
- TotalWorkingYears
- TrainingTimesLastYear
- WorkLifeBalance
- YearsAtCompany
- YearsInCurrentRole
- YearsSinceLastPromotion
- YearsWithCurrManager
The Dataset consists of 35 columns, through which we aim to predict whether an employee will leave the job or not. This is a binary classification problem, where the outcome 'Attrition' will either be 'true' or 'false'. In this experiment, we will be using HyperDrive and AutoML to find the best prediction for the given Dataset. We will then deploy the model with the best prediction and interact with the deployment.
The dataset is available on Kaggle as 'IBM HR Analytics Employee Attrition & Performance' dataset, but for this project, the dataset has been uploaded onto GitHub and is accessed through the following URI: 'https://raw.githubusercontent.com/manas-v/Capstone-Project-Azure-Machine-Learning-Engineer/main/WA_Fn-UseC_-HR-Employee-Attrition.csv'
We then use Tabular Dataset Factory's Dataset.Tabular.from_delimited_files()
to get the data from the url and save it to the datastore by using dataset.register()
AutoML or Automated ML is the process of automating the task of machine learning model development. Using this feature, you can predict the best ML model, and its hyperparameters suited for your problem statement.
This is a binary classification problem with the label column 'Attrition' having output as 'true' or 'false'. The experiment timeout is 20 mins, a maximum of 5 concurrent iterations take place together, the primary metric for the run is AUC_weighted.
The AutoML configurations used for this experiment are:
Configuration | Value | Explanation |
---|---|---|
experiment_timeout_minutes | 20 | Maximum amount of time in minutes that all iterations combined can take before the experiment terminates |
max_concurrent_iterations | 5 | Represents the maximum number of iterations that would be executed in parallel |
primary_metric | AUC_weighted | The metric that Automated Machine Learning will optimize for model selection |
compute_target | cpu_cluster(created) | The Azure Machine Learning compute target to run the Automated Machine Learning experiment on |
task | classification | The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve |
training_data | dataset(imported) | The training data to be used within the experiment |
label_column_name | Attrition | The name of the label column |
path | ./capstone-project | The full path to the Azure Machine Learning project folder |
enable_early_stopping | True | Whether to enable early termination if the score is not improving in the short term |
featurization | auto | Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used |
debug_log | automl_errors.log | The log file to write debug information to |
After running the AutoML pipeline, the best performing model is found to be VotingEnsemble with an AUC_weighted value of 0.83328615. VotingEnsemble combines conceptually different machine learning classifiers and uses a majority vote or the average predicted probabilities (soft vote) to predict the class labels. This is method balances out the individual weaknesses of the considered classifiers.
The AutoML Voting Classifier for this run is made up of a combination of 11 classifiers with different hyperparameter values and normalization/scaling techinques. The 11 estimators used in the run weigh 0.07692307692307693, 0.07692307692307693, 0.07692307692307693, 0.07692307692307693, 0.07692307692307693, 0.07692307692307693, 0.23076923076923078, 0.07692307692307693, 0.07692307692307693, 0.07692307692307693, 0.07692307692307693 respectively.
Considering the randomforestclassifier, the model with one of the highest weights i.e. 0.23076923076923078 The hyperparameters generated for the model are:
14 - maxabsscaler
{'copy': True}
14 - randomforestclassifier
{'bootstrap': True,
'ccp_alpha': 0.0,
'class_weight': None,
'criterion': 'gini',
'max_depth': None,
'max_features': 'log2',
'max_leaf_nodes': None,
'max_samples': None,
'min_impurity_decrease': 0.0,
'min_impurity_split': None,
'min_samples_leaf': 0.01,
'min_samples_split': 0.01,
'min_weight_fraction_leaf': 0.0,
'n_estimators': 10,
'n_jobs': 1,
'oob_score': False,
'random_state': None,
'verbose': 0,
'warm_start': False}
Output of RunDetails() widget
Visualization of results
Best performing model
Cross-Validation - Change the number of cross-validation folds in the AutoML run.
Primary metric - Attempting to look at other primary metrics too, incase they are more suitable for the model.
AutoML configurations - Use different AutoML configurations like experiment timeout, max concurrent iterations, etc, and observe the change in result.
Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.
The problem statement being a binary classification problem, the model used in it is the DecisionTreeClassifier. Decision Trees are simple to understand and to interpret, they are easy to visualize, require little data preparation.
The hyperdrive configuration used for this experiment are as follows:
In this experiment, the early stopping policy used is Bandit Policy. Bandit policy is based on the difference in performance from the current best run, called 'slack'. Here the runs terminate where the primary metric is not within the specified slack factor/slack amount compared to the best performing run.
early_termination_policy = BanditPolicy(slack_factor=0.1,evaluation_interval=3)
In this experiment, the parameter sampler used is Random Sampling. In random sampling, hyperparameter values are randomly selected from among the given search space. Random sampling is chosen because it supports discrete and continuous hyperparameters, and early termination of low-performance runs.
param_sampling = RandomParameterSampling({"--criterion": choice("gini", "entropy"),"--splitter": choice("best", "random"), "--max_depth": choice(3,4,5,6,7,8,9,10)})
In this experiment, the estimator used is SKLearn. An estimator that will be called with sampled hyper parameters.
estimator = SKLearn(source_directory=".", compute_target=cpu_cluster, entry_script="train.py")
HyperDrive configuration
Configuration | Value | Explanation |
---|---|---|
hyperparameter_sampling | Value | Explanation |
policy | early_termination_policy | The early termination policy to use |
primary_metric_name | AUC_weighted | The name of the primary metric reported by the experiment runs |
primary_metric_goal | PrimaryMetricGoal.MAXIMIZE | One of maximize / minimize. It determines if the primary metric has to be minimized/maximized in the experiment runs' evaluation |
max_total_runs | 12 | Maximum number of runs. This is the upper bound |
max_concurrent_runs | 4 | Maximum number of runs to run concurrently. |
estimator | 4 | An estimator that will be called with sampled hyper parameters |
The Hyperparameters for the Decision Tree are:
Hyperparameter | Value | Explanation |
---|---|---|
criterion | choice("gini", "entropy") | The function to measure the quality of a split. |
splitter | choice("best", "random") | The strategy used to choose the split at each node. |
max_depth | choice(3,4,5,6,7,8,9,10) | The maximum depth of the tree. |
The best result using HyperDrive was Decision Tree with Parameter Values as criterion = gini, max_depth = 4, splitter = best. The AUC_weighted of the Best Run is 0.7214713617767388.
Output of RunDetails() widget
Visualization of results
Best performing model
Model selection - Select a different classification ML algorithm to apply.
Sampling - Other parameter sampling methods to use over the hyperparameter space could be implemented i.e. Grid sampling, Bayesian sampling.
Early termination - Use other early termination policies such as Median stopping policy, Truncation selection policy. Different Early termination policies could be applied to keep the run most time/cost-efficient, yet having the best results.
Resource allocation - Different resource allocation in terms of max_total_runs, max_duration_minutes or max_concurrent_runs for HyperDrive configuration.
Among AutoML and HyperDrive, the AUC_weighted of both were 0.83328615 and 0.7214713617767388. The AutoML being the one with the better results, we will deploy it.
The workflow for deploying a model to Azure CLI is
- Register the model - A registered model is a logical container for one or more files that make up your model. Here we use the registered AutoML model.
- Prepare an inference configuration (unless using no-code deployment) - An inference configuration describes how to set up the web-service containing your model.
- Prepare an entry script (unless using no-code deployment) - The entry script receives data submitted to a deployed web service and passes it to the model. It then takes the response returned by the model and returns that to the client.
- Choose a compute target - The compute target you use to host your model will affect the cost and availability of your deployed endpoint.
- Deploy the model to the compute target - Before deploying your model, you must define the deployment configuration. The deployment configuration is specific to the compute target that will host the web service. The model will now deploy.
- Test the resulting web service - You can test the model by querying the endpoint and sending sample input data to get JSON response.
Deployment Healthy
Application Insights is used to detect anomalies, visualize performance, etc. It can be enabled before or after a deployment is created. For this experiment, we enable logging for the deployed model by running the logs.py script.
Logging enabled for endpoint
Link to Screen Recording: https://youtu.be/zp1xjkhsK9k