- Capstone Project - Azure Machine Learning Engineer Nanodegree - Kechagias Konstantinos
In the capstone project I will use the knowledge that I obtained during Machine Learning Engineer with Microsoft Azure Nanodegree Program to solve the Kaggle Titanic Challenge
Some passengers of the titanic were more likely to survive than others. The dataset from Kaggle give information for 871 passengers. A a column indicates if they have survived or not. My target is to make a model that will predict which passengers survived the Titanic.
Here we do this in two different ways:
- Using AutoML.
- Using a custom model and tuning its hyperparameters with HyperDrive.
Then, I will compare the performance of both the models and deploy the best performing model. The deployment is done using the Azure Python SDK, and creates an endpoint that can be accessed through a REST API. This step allows any new data to be evaluated by the model through the service easily.
The dataset chosen for this project is Kaggle Titanic Challenge.
Some passengers of the titanic were more likely to survive than others. The dataset from Kaggle give information about 871 passengers including a column that states if they have survived or not. My target is to make a model that will predict which passengers survived the Titanic.
I will use only the "training" data because this dataset have the data with the "Survived" label, which is necessary for the Supervised Learning algorithms that are used in the capstone project.
Find below the data dictionary:
Variable | Definition | Key |
---|---|---|
Survived | Survival | integer |
Nclass | Ticket class | integer |
Name | Name of the passenger | string |
Sex | Passenger Sex | string |
Age | Age | integer |
Pibsp | # of siblings / spouses aboard the Titanic | integer |
Parch | # of parents / children aboard the Titanic | integer |
Ticket | Ticket number | string |
Fare | Passenger fare | float |
Cabin | Cabin number | string |
Embarked | Port of Embarkation | string |
The data has been uploaded to this Git repository this repository.
The data has been uploaded to this Git repository this repository. To access it in Azure notebooks, we need to download it from an external link into the Azure workspace.
For that, we can use the Dataset
class, which allows importing tabular data from files on the web.
I aim to create a model, using Accuracy as metric, to classify if a passenger survives or not the Titanic. I examine two solutions:
-
Automated ML: I provided the dataset to AutoML and it automatically did the featurization, tries different algorithms, and test the performance of different models.
-
HyperDrive: I tested a single algorithm and I created different models by providing different hyperparameters. The chosen algorithm is Logistic Regression using the framework SKLearn. Hyperparameter selection mad using Hyperdrive.
In both cases, best performing model created during runs had been saved.
The features that I used in this experiment was the ones described in the data dictionary above. However, in the case of the HyperDrive, we manually remove the columns "Name", "Ticket", and "Cabin", "Sex", "Embarked" which are not supported by the Logistic Regression classifier.
For the AutoML run, I created a compute cluster to run the experiment.
The constructor of AutoMLConfig
class takes the following parameters:
task
: type of ML problem to solve, set asclassification
;compute_target
: cluster where the experiment jobs will run;experiment_timeout_minutes
: 20;training_data
: the dataset loaded;label_column_name
: The column that should be predicted, which is the "Survived" one;enable_early_stopping
: makes it possible for the AutoML to stop jobs that are not performing well after a minimum number of iterations;path
: the full path to the Azure Machine Learning project folder;featurization
: indicator that featurization step should be done automatically;debug_log
: The log file to write debug information to;automl_settings
: other settings passed as a dictionary.max_concurrent_iterations
: Represents the maximum number of iterations that would be executed in parallel. Set to 9;primary_metric
: The metric that Automated Machine Learning will optimize for model selection. We chose to optimize forAccuracy
.
Because AutoML is an automated process that might take a long time, it is a good idea to enable the early stopping. This can help in the cost minimazation. AutoML with enable early stopping option is able to kill jobs that are not performing well, leading to better resource usage
Among many experiments maded by the AutoML, the best model had an accuracy of 83,50%.
Voting Ensemble uses multiple models as inner estimators and each one has its unique hyperparameters.
The model created by the AutoML deployed in an endpoint.
The expected input type is a JSON with the following format:
"data":
[
{
"PassengerId": integer,
"Pclass": integer,
"Age": float,
"Sex": string,
"SibSp": integer,
"Parch": integer,
"Fare": float,
"Embarked": string
}
]
Service of AutoML model with "Active" deployment state, scoring URI, and swagger URI. Also, a response from the server is included.
My deployed web app had enabled logging. I used application insights for logging to Monitor and collect data from ML web service endpoint. Logging was enabled programmatically, and the code can be found in the jupyter.
I used Hyperpameter Tuning Tool with a Logistic Regression model from the SKLearn framework in order to classify if a passenger would survive or not in the Titanic. Logistic regression assumes a linear relationship between input and output. I selected Logistic regression because this will allow me to experiment quickly in the Azure ML environment.
Hyperdrive is used to sample different values for two algorithm hyperparameters:
C
: Inverse of regularization strengthmax_iter
: Maximum number of iterations taken for the solvers to converge
I sample the values using Random Sampling, where hyperparameter values are randomly selected from the defined search space. C
is chosen randomly in uniformly distributed between 0.001 and 1.0. Max_iter
sampled from one of the three values: 1000, 10000, and 100000.
HyperDrive accuracy was 74,43% which is not so good as the AutoML run.
The parameters used by this classifier are the following:
- C = 0.9758520032406058
- Max iterations = 100000
I recorded my screen in full screen mode at 1080p and 16:9 aspect ratio. I used OBS for the recording.
There are several ways in order to imporve our AutoML and HyperDrive runs.
Firstly, in both runs we could change the performance metric from Accuracy
to AUC_weighted
for example, which could produce better results.
An improvement for AutoML run, is to choose the best 3-5 algorithms and create another AutoML run with only this algorithms. I could also have a look at the data that has been wrongly classified by the best model and try to identify a pattern that could lead to transformations on them. That can be done by creating a pipeline with a first step to transform the data and a second one to execute the AutoML.
An improvement for the HyperDrive run is to test different classifier algorithms in our training script.Also, I could test another algorithm like Random Forests and Decision Trees. For every of those algorithms a different set of hyperparameters can be choose using either Random Sampling or other sampling methods. Deep Learning algorithms could also be applied to solve this problem, and it will be interesting to look into thme.