Capstone-AzureML

Udacity-Final project

Here we are going to do binary classification on Stroke Prediction dataset. Here our goal is to train the dataset, find the best model and deploy it as a webservice and inference with the endpoints

OverView

First, we need to choose a dataset from external resources , here in my case I have chosen the dataset from Kaggle and load the dataset by the code which we wrote in the notebooks and register them. Next, we are going to create two best models from HyperDrive and AutoML : first one with customised parameters and the second which going to be chosen by AutoML and depending upon the optimal parameters and fit with the model, we are going to deploy that best model as a webservice. The webservice can be accessed by using the RESTAPI and consume and test the endpoints

Project set up and installation

Since I was using the workspace provided by the Udacity, the workspace environment and compute cluster were are already created for me. One needs azure subscription and credentials into order to access the azure portal

Dataset

OverView

According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.

This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient. The dataset is from Kaggle

Task

The task, we are going to perform here is find optimal model through HyperDrive and AutoML and deploy that best model for binary classification and consume the endpoints

Attributes of the Dataset are

Attributes	Labels
id	unique identifier
Gender	Male , Female, other
Age	Age of patient
HyperTension	0 indicates no hypertension,1 indactes hypertension
Heart disease	0 is for No, 1 is for Yes
ever_married	Yes or No
Work_type	Govt_jov",Never_worked, Private or Self-employed
Residence_type	Rural or Urban
avg_glucoselevel	indicates glucose level in body
bmi	Body mass index
smoking_status	smokes,unknown,formerly smoked,never smoke
stroke	if the patient had 1 else 0

access for the data

found = False
key = 'strokeDataset'
description_text = "Prediction of Stroke"

if key in ws.datasets.keys():
    found = True
    dataset = ws.datasets[key]
    
if not found :
    example = 'https://raw.githubusercontent.com/123manju900/Capstone-AzureML/main/stroke-prediction-dataset.csv'
    dataset = Dataset.Tabular.from_delimited_files(example)

For accessing the Dataset, we can run this command.I have stored the dataset in my Github repo and accessed it

For registering the model

dataset = dataset.register(workspace = ws,
                          name = key , 
                          description = description_text )

Upload of Dataset

list of all Experiments

hd-experiment is the experiment submitted by HyperDrive and Auto-stoke is the experiment submitted by AutoML

AutoML

AutoML settings

experiment_timeout_minutes: Here I have given 30 mins of time to run all the algorithms

max_concurrent_iterations : Given according to the max nodes allocated to compute

n_cross_validations: Number of splits of Data to be split while training the model

classification : Here are performing binary classification

label_column_name : stroke as we are trying to predict a person has suffered from stroke or not

enable_early_stopping : Inorder to avoid unnnecessary usage of compute , it is enabled

featurization: Here it is set to auto where it will automatically identify the type of featurization according to the data

AutoML runwidget

The autoML settings are submitted and Runwidgets is run

Widget showing all the successful runs of AutoML

Completion of AutoML run

Here we can see the screenshot of AutoMl run

best model

The best model I got is VotingEnsemble

List of other algorithms along with best model

Parameters of voting ensemble

Graphs representing accuracy and other metric of voting ensemble

Graph showing accuracy of voting ensemble

Other metrics about the best model

HyperDrive

For running drive module, I have run HyperDrive along with Train.py file. Since, it is a classification(binary) problem , I have chosen Logistic regression as this runs well with Binary classification

Train.py

In this file I have specified the dataset url which I have stored it on my github and done some featurization. The columns like Residence_type , gender , ever_married , work_type were categorical in nature which I have encoded them into mumeric types since logistic regression doesn't support categorical type variables

Hyperdriveconfig

Parameters

RandomParamtersampling : This sampling could could be used for both discrete and continous data

parameters I have taken in RandomParametersampling

C: This indicates the inverse regularisation strength. Regularization to decrease the cost function. Lesser C value indicates stronger regularization strength

max_iter : This indicates the number of iterations it is going to perform to get better accuracy on the model

Early termination policy

The policy I have used is BanditPolicy

Bandit policy is based on slack factor/slack amount and evaluation interval. Bandit ends runs when the primary metric isn't within the specified slack factor/slack amount of the most successful run.

slack factor : It defines the slack allowed with respect to the best performing training run.

SKLearn estimator Creates an estimator for training in Scikit-learn experiments.

The max_total_runs I have used here is 30 for better model trianing and max_concurrent_runs depending upon the maximum nodes allocated

RunWidget

After passing the required parameters in the HyperDriveConfig , I have submitted the run and here are the screenshots of HyperDriveConfig

Widget showing successful runs

Screenshot showing the completed status of HyperDrive experiment

Graphs related to the runs

Best_Run

Here we can see the accuracy is 0.94

best_parameters

here we can see the best parameters for C is 0.89 and max_iter could be 150 .

Model deployment

Here one may feel that the HyperDriveConfig has performed better than the AutoML but here is something we need to consider as we look at Regularization factor, it is 0.89 which indicates that the cost function is high for this alogorithm. Meaning although it may have given better accuracy on this dataset but it's going to fail on similar data. The maximum number of iterations indicate that the model is over-fitted. So, to come to conclusion here VotingEnsemble is the optimal and best algorithm. Voting Ensemble combines more than one algorithm for prediction and predicts using voting count which indiactes it has low-variance to the dataset

Deploy model

For deploying a model first we register the best model, for registering the best model we can run this code

automodel = best_run.register_model(model_name='automl_model', 
                                  model_path='outputs/model.pkl',
                                  tags={'Method':'AutoML'})

print(automodel)

Once the model is registered, we also need to scoring.py and env.yml files for deployment

Scoring.py : This contains all the required configurations for the deployed model env.yml : It contains the environment supporting libraries to run the model

We can download them using the following code

# Download scoring file 
best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'score.py')

# Download environment file
best_run.download_file('outputs/conda_env_v_1_0_0.yml', 'env.yml')
d

Now we are going to pass these files to InferenceConfig where it is going all the credentials required to deploy the model on the cloud and Finally deploy it using AciWebservice

inference_config = InferenceConfig(entry_script = script_file, environment = env)

aci_config = AciWebservice.deploy_configuration(cpu_cores = 1,
                                                memory_gb = 1, 
                                                enable_app_insights = True,
                                                auth_enabled = True)
                                            

aci_service_name = 'automl-webservice1'

Deploy the webservice

It takes few minutes to deploy the webservice and we can see that the webservice url in the above picture

Displaying service Token

Dispaly of deployed service in endpoints section

Webservice showing it is in healthy state

While deploying the service, I have enabled app_insights = True which gives valuable information regarding the deployed model

Consuming the RESTAPI

TEST

For sending an infernce request and test the model, we can run the following code

import requests
headers = {'Content-type': 'application/json'}

headers['Authorization'] = f'Bearer {key}'


response = requests.post(service.scoring_uri,sample_json, headers = headers )

# for viewing the results 

print(response.text )

The json data that was loaded in the earlier step is sent for inferencing using the requests library, since I have enabled key-based authentication I'm first going to provide the primary key for authentication and query the endpoint with the sample input

and got the input as follows

Service delete

Video

YouTuBe

Future Improvements

Enable ONNX conversion and deploy the model
Allow more time to train the data using AutoML for training and check for accuracy
Using SMOTE on the dataset before HyperDrive and check the metrics
Train on more data and test the model
Deploy the model on IOT azure