Udacity-Final project
Here we are going to do binary classification on Stroke Prediction dataset. Here our goal is to train the dataset, find the best model and deploy it as a webservice and inference with the endpoints
First, we need to choose a dataset from external resources , here in my case I have chosen the dataset from Kaggle and load the dataset by the code which we wrote in the notebooks and register them. Next, we are going to create two best models from HyperDrive and AutoML : first one with customised parameters and the second which going to be chosen by AutoML and depending upon the optimal parameters and fit with the model, we are going to deploy that best model as a webservice. The webservice can be accessed by using the RESTAPI and consume and test the endpoints
Since I was using the workspace provided by the Udacity, the workspace environment and compute cluster were are already created for me. One needs azure subscription and credentials into order to access the azure portal
According to the World Health Organization (WHO) stroke is the 2nd leading cause of death globally, responsible for approximately 11% of total deaths.
This dataset is used to predict whether a patient is likely to get stroke based on the input parameters like gender, age, various diseases, and smoking status. Each row in the data provides relavant information about the patient. The dataset is from Kaggle
The task, we are going to perform here is find optimal model through HyperDrive and AutoML and deploy that best model for binary classification and consume the endpoints
Attributes | Labels |
---|---|
id | unique identifier |
Gender | Male , Female, other |
Age | Age of patient |
HyperTension | 0 indicates no hypertension,1 indactes hypertension |
Heart disease | 0 is for No, 1 is for Yes |
ever_married | Yes or No |
Work_type | Govt_jov",Never_worked, Private or Self-employed |
Residence_type | Rural or Urban |
avg_glucoselevel | indicates glucose level in body |
bmi | Body mass index |
smoking_status | smokes,unknown,formerly smoked,never smoke |
stroke | if the patient had 1 else 0 |
found = False
key = 'strokeDataset'
description_text = "Prediction of Stroke"
if key in ws.datasets.keys():
found = True
dataset = ws.datasets[key]
if not found :
example = 'https://raw.githubusercontent.com/123manju900/Capstone-AzureML/main/stroke-prediction-dataset.csv'
dataset = Dataset.Tabular.from_delimited_files(example)
For accessing the Dataset, we can run this command.I have stored the dataset in my Github repo and accessed it
For registering the model
dataset = dataset.register(workspace = ws,
name = key ,
description = description_text )
hd-experiment
is the experiment submitted by HyperDrive and Auto-stoke
is the experiment submitted by AutoML
experiment_timeout_minutes
: Here I have given 30 mins of time to run all the algorithms
max_concurrent_iterations
: Given according to the max nodes allocated to compute
n_cross_validations
: Number of splits of Data to be split while training the model
classification
: Here are performing binary classification
label_column_name
: stroke as we are trying to predict a person has suffered from stroke or not
enable_early_stopping
: Inorder to avoid unnnecessary usage of compute , it is enabled
featurization
: Here it is set to auto where it will automatically identify the type of featurization according to the data
The autoML settings are submitted and Runwidgets is run
Widget showing all the successful runs of AutoML
Here we can see the screenshot of AutoMl run
The best model I got is VotingEnsemble
List of other algorithms along with best model
Parameters of voting ensemble
Graphs representing accuracy and other metric of voting ensemble
Graph showing accuracy of voting ensemble
Other metrics about the best model
For running drive module, I have run HyperDrive along with Train.py file. Since, it is a classification(binary) problem , I have chosen Logistic regression as this runs well with Binary classification
Train.py
In this file I have specified the dataset url which I have stored it on my github and done some featurization. The columns like Residence_type , gender , ever_married , work_type were categorical in nature which I have encoded them into mumeric types since logistic regression doesn't support categorical type variables
Parameters
RandomParamtersampling :
This sampling could could be used for both discrete and continous data
parameters I have taken in RandomParametersampling
C:
This indicates the inverse regularisation strength. Regularization to decrease the cost function. Lesser C value indicates stronger regularization strength
max_iter :
This indicates the number of iterations it is going to perform to get better accuracy on the model
The policy I have used is BanditPolicy
Bandit policy is based on slack factor/slack amount and evaluation interval. Bandit ends runs when the primary metric isn't within the specified slack factor/slack amount of the most successful run.
slack factor : It defines the slack allowed with respect to the best performing training run.
SKLearn estimator Creates an estimator for training in Scikit-learn experiments.
The max_total_runs I have used here is 30 for better model trianing and max_concurrent_runs depending upon the maximum nodes allocated
After passing the required parameters in the HyperDriveConfig , I have submitted the run and here are the screenshots of HyperDriveConfig
Widget showing successful runs
Screenshot showing the completed status of HyperDrive experiment
Graphs related to the runs
Here we can see the accuracy is 0.94
here we can see the best parameters for C is 0.89 and max_iter could be 150 .
Here one may feel that the HyperDriveConfig has performed better than the AutoML but here is something we need to consider as we look at Regularization factor, it is 0.89 which indicates that the cost function is high for this alogorithm. Meaning although it may have given better accuracy on this dataset but it's going to fail on similar data. The maximum number of iterations indicate that the model is over-fitted. So, to come to conclusion here VotingEnsemble is the optimal and best algorithm. Voting Ensemble combines more than one algorithm for prediction and predicts using voting count which indiactes it has low-variance to the dataset
For deploying a model first we register the best model, for registering the best model we can run this code
automodel = best_run.register_model(model_name='automl_model',
model_path='outputs/model.pkl',
tags={'Method':'AutoML'})
print(automodel)
Once the model is registered, we also need to scoring.py and env.yml files for deployment
Scoring.py : This contains all the required configurations for the deployed model env.yml : It contains the environment supporting libraries to run the model
We can download them using the following code
# Download scoring file
best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'score.py')
# Download environment file
best_run.download_file('outputs/conda_env_v_1_0_0.yml', 'env.yml')
d
Now we are going to pass these files to InferenceConfig where it is going all the credentials required to deploy the model on the cloud and Finally deploy it using AciWebservice
inference_config = InferenceConfig(entry_script = script_file, environment = env)
aci_config = AciWebservice.deploy_configuration(cpu_cores = 1,
memory_gb = 1,
enable_app_insights = True,
auth_enabled = True)
aci_service_name = 'automl-webservice1'
It takes few minutes to deploy the webservice and we can see that the webservice url in the above picture
Displaying service Token
Dispaly of deployed service in endpoints section
Webservice showing it is in healthy state
While deploying the service, I have enabled app_insights = True
which gives valuable information regarding the deployed model
Consuming the RESTAPI
TEST
For sending an infernce request and test the model, we can run the following code
import requests
headers = {'Content-type': 'application/json'}
headers['Authorization'] = f'Bearer {key}'
response = requests.post(service.scoring_uri,sample_json, headers = headers )
# for viewing the results
print(response.text )
The json data that was loaded in the earlier step is sent for inferencing using the requests library, since I have enabled key-based authentication I'm first going to provide the primary key for authentication and query the endpoint with the sample input
Service delete
- Enable ONNX conversion and deploy the model
- Allow more time to train the data using AutoML for training and check for accuracy
- Using SMOTE on the dataset before HyperDrive and check the metrics
- Train on more data and test the model
- Deploy the model on IOT azure