Capstone-Project-Azure-Machine-Learning-Engineer

This is the final project of the Udacity Azure ML Nanodegree.

In this project, we will create two models on the 'IBM HR Analytics Employee Attrition & Performance' dataset: One using Automated ML (denoted as AutoML from now on) and one customized model whose hyperparameters are tuned using HyperDrive. We will then compare the performance of both the models and deploy the best performing model.

Project Workflow

Dataset

Overview

Attrition has always been a major concern in any organization. The IBM HR Attrition Case Study is a fictional dataset that aims to identify important factors that might be influential in determining which employee might leave the firm and who may not.

Dataset Attributes

Age
Attrition
BusinessTravel
DailyRate
Department
DistanceFromHome
Education
EducationField
EmployeeCount
EmployeeNumber
EnvironmentSatisfaction
Gender
HourlyRate
JobInvolvement
JobLevel
JobRole
JobSatisfaction
MaritalStatus
MonthlyIncome
MonthlyRate
NumCompaniesWorked
Over18
OverTime
PercentSalaryHike
PerformanceRating
RelationshipSatisfaction
StandardHours
StockOptionLevel
TotalWorkingYears
TrainingTimesLastYear
WorkLifeBalance
YearsAtCompany
YearsInCurrentRole
YearsSinceLastPromotion
YearsWithCurrManager

Task

The Dataset consists of 35 columns, through which we aim to predict whether an employee will leave the job or not. This is a binary classification problem, where the outcome 'Attrition' will either be 'true' or 'false'. In this experiment, we will be using HyperDrive and AutoML to find the best prediction for the given Dataset. We will then deploy the model with the best prediction and interact with the deployment.

Access

The dataset is available on Kaggle as 'IBM HR Analytics Employee Attrition & Performance' dataset, but for this project, the dataset has been uploaded onto GitHub and is accessed through the following URI: 'https://raw.githubusercontent.com/manas-v/Capstone-Project-Azure-Machine-Learning-Engineer/main/WA_Fn-UseC_-HR-Employee-Attrition.csv' We then use Tabular Dataset Factory's Dataset.Tabular.from_delimited_files() to get the data from the url and save it to the datastore by using dataset.register()

Automated ML

AutoML or Automated ML is the process of automating the task of machine learning model development. Using this feature, you can predict the best ML model, and its hyperparameters suited for your problem statement.

This is a binary classification problem with the label column 'Attrition' having output as 'true' or 'false'. The experiment timeout is 20 mins, a maximum of 5 concurrent iterations take place together, the primary metric for the run is AUC_weighted.

The AutoML configurations used for this experiment are:

Configuration	Value	Explanation
experiment_timeout_minutes	20	Maximum amount of time in minutes that all iterations combined can take before the experiment terminates
max_concurrent_iterations	5	Represents the maximum number of iterations that would be executed in parallel
primary_metric	AUC_weighted	The metric that Automated Machine Learning will optimize for model selection
compute_target	cpu_cluster(created)	The Azure Machine Learning compute target to run the Automated Machine Learning experiment on
task	classification	The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve
training_data	dataset(imported)	The training data to be used within the experiment
label_column_name	Attrition	The name of the label column
path	./capstone-project	The full path to the Azure Machine Learning project folder
enable_early_stopping	True	Whether to enable early termination if the score is not improving in the short term
featurization	auto	Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used
debug_log	automl_errors.log	The log file to write debug information to

AutoML Results

After running the AutoML pipeline, the best performing model is found to be VotingEnsemble with an AUC_weighted value of 0.83328615. VotingEnsemble combines conceptually different machine learning classifiers and uses a majority vote or the average predicted probabilities (soft vote) to predict the class labels. This is method balances out the individual weaknesses of the considered classifiers.

The AutoML Voting Classifier for this run is made up of a combination of 11 classifiers with different hyperparameter values and normalization/scaling techinques. The 11 estimators used in the run weigh 0.07692307692307693, 0.07692307692307693, 0.07692307692307693, 0.07692307692307693, 0.07692307692307693, 0.07692307692307693, 0.23076923076923078, 0.07692307692307693, 0.07692307692307693, 0.07692307692307693, 0.07692307692307693 respectively.

Considering the randomforestclassifier, the model with one of the highest weights i.e. 0.23076923076923078 The hyperparameters generated for the model are:

14 - maxabsscaler
{'copy': True}

14 - randomforestclassifier
{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'log2',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 0.01,
 'min_samples_split': 0.01,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 10,
 'n_jobs': 1,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

AutoML Screenshots

Output of RunDetails() widget

Visualization of results

Best performing model

Future work AutoML

Cross-Validation - Change the number of cross-validation folds in the AutoML run.

Primary metric - Attempting to look at other primary metrics too, incase they are more suitable for the model.

AutoML configurations - Use different AutoML configurations like experiment timeout, max concurrent iterations, etc, and observe the change in result.

Hyperparameter Tuning using HyperDrive

Decision Trees (DTs) are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. A tree can be seen as a piecewise constant approximation.

The problem statement being a binary classification problem, the model used in it is the DecisionTreeClassifier. Decision Trees are simple to understand and to interpret, they are easy to visualize, require little data preparation.

The hyperdrive configuration used for this experiment are as follows:

In this experiment, the early stopping policy used is Bandit Policy. Bandit policy is based on the difference in performance from the current best run, called 'slack'. Here the runs terminate where the primary metric is not within the specified slack factor/slack amount compared to the best performing run.

early_termination_policy = BanditPolicy(slack_factor=0.1,evaluation_interval=3)

In this experiment, the parameter sampler used is Random Sampling. In random sampling, hyperparameter values are randomly selected from among the given search space. Random sampling is chosen because it supports discrete and continuous hyperparameters, and early termination of low-performance runs.

param_sampling = RandomParameterSampling({"--criterion": choice("gini", "entropy"),"--splitter": choice("best", "random"), "--max_depth": choice(3,4,5,6,7,8,9,10)})

In this experiment, the estimator used is SKLearn. An estimator that will be called with sampled hyper parameters.

estimator = SKLearn(source_directory=".", compute_target=cpu_cluster, entry_script="train.py")

HyperDrive configuration

Configuration	Value	Explanation
hyperparameter_sampling	Value	Explanation
policy	early_termination_policy	The early termination policy to use
primary_metric_name	AUC_weighted	The name of the primary metric reported by the experiment runs
primary_metric_goal	PrimaryMetricGoal.MAXIMIZE	One of maximize / minimize. It determines if the primary metric has to be minimized/maximized in the experiment runs' evaluation
max_total_runs	12	Maximum number of runs. This is the upper bound
max_concurrent_runs	4	Maximum number of runs to run concurrently.
estimator	4	An estimator that will be called with sampled hyper parameters

The Hyperparameters for the Decision Tree are:

Hyperparameter	Value	Explanation
criterion	choice("gini", "entropy")	The function to measure the quality of a split.
splitter	choice("best", "random")	The strategy used to choose the split at each node.
max_depth	choice(3,4,5,6,7,8,9,10)	The maximum depth of the tree.

HyperDrive Results

The best result using HyperDrive was Decision Tree with Parameter Values as criterion = gini, max_depth = 4, splitter = best. The AUC_weighted of the Best Run is 0.7214713617767388.

HyperDrive Screenshots

Output of RunDetails() widget

Visualization of results

Best performing model

Future work HyperDrive

Model selection - Select a different classification ML algorithm to apply.

Sampling - Other parameter sampling methods to use over the hyperparameter space could be implemented i.e. Grid sampling, Bayesian sampling.

Early termination - Use other early termination policies such as Median stopping policy, Truncation selection policy. Different Early termination policies could be applied to keep the run most time/cost-efficient, yet having the best results.

Resource allocation - Different resource allocation in terms of max_total_runs, max_duration_minutes or max_concurrent_runs for HyperDrive configuration.

Model Deployment

Among AutoML and HyperDrive, the AUC_weighted of both were 0.83328615 and 0.7214713617767388. The AutoML being the one with the better results, we will deploy it.

The workflow for deploying a model to Azure CLI is

Register the model - A registered model is a logical container for one or more files that make up your model. Here we use the registered AutoML model.
Prepare an inference configuration (unless using no-code deployment) - An inference configuration describes how to set up the web-service containing your model.
Prepare an entry script (unless using no-code deployment) - The entry script receives data submitted to a deployed web service and passes it to the model. It then takes the response returned by the model and returns that to the client.
Choose a compute target - The compute target you use to host your model will affect the cost and availability of your deployed endpoint.
Deploy the model to the compute target - Before deploying your model, you must define the deployment configuration. The deployment configuration is specific to the compute target that will host the web service. The model will now deploy.
Test the resulting web service - You can test the model by querying the endpoint and sending sample input data to get JSON response.

Deployment Healthy

Standout Suggestions

Enable logging

Application Insights is used to detect anomalies, visualize performance, etc. It can be enabled before or after a deployment is created. For this experiment, we enable logging for the deployed model by running the logs.py script.

Logging enabled for endpoint

Screen Recording

Link to Screen Recording: https://youtu.be/zp1xjkhsK9k

ObinnaIheanachor/Capstone-Project-Azure-Machine-Learning-Engineer