This project demonstrates the practical application of Google Cloud's Vertex AI to build, train, and deploy a machine learning model aimed at predicting loan repayment risks using a tabular dataset.
- Project Overview
- Objectives
- Technologies Used
- Project Steps
- Key Terminologies
- Conclusion
- Future Work
- Acknowledgements
- License
This project uses Vertex AI, Google Cloud's unified machine learning platform, to predict loan repayment risks. Vertex AI simplifies the process of training machine learning models by automating feature engineering, model selection, and hyperparameter tuning. The primary goal is to build a classification model that accurately predicts whether a loan applicant will repay their loan or default based on historical data.
- Upload a Dataset to Vertex AI: Prepare and import tabular data into Vertex AI for model training.
- Train a Machine Learning Model with AutoML: Utilize Vertex AI's capabilities to build a classification model.
- Evaluate Model Performance: Understand key evaluation metrics to assess model effectiveness.
- Deploy the Model: (Demonstration Only) Learn the steps to deploy a trained model to an endpoint for serving predictions.
- Authenticate and Make Predictions: Use a Bearer Token to securely interact with the deployed model and obtain predictions.
Objective: Created a dataset in Vertex AI named LoanRisk to store and manage training data.
Objective: Import the loan risk dataset into Vertex AI from Google Cloud Storage. Generated statistics to obtain descriptive statistics for each column. This helps in understanding data distribution and identifying anomalies.
- Each column showed detailed analytical charts.
Objective: Obtain descriptive statistics for each column to better understand the dataset.
Objective: Trained a classification model to predict whether a customer will repay a loan using Vertex AI's AutoML.
Objective: "Classification"** for predicting a categorical outcome (0
for repay, 1
for default).
Explored Advanced Options
- Configured how to split your data into training and testing sets.
- Specified encryption settings.
-
Add Features:
- Customized which columns to include.
-
Exclude Irrelevant Features:
- For instance, ClientID was irrelevant for predicting loan risk hence was excluded while the training model.
-
Explored Advanced Options:
- Explored additional optimization objectives and feature engineering options as needed.
- Set Budget:
- Budget: Entered
1
to allocate 1 node hour for training.
- Budget: Entered
- Enabled Early Stopping:
- Early Stopping: To allow the training process to halt early if it converges, saving compute resources.
Objective: Understand how to evaluate model performance using Vertex AI's evaluation metrics.
-
Precision/Recall Curve:
- Precision: Measures the accuracy of positive predictions.
- Recall: Measures the ability to find all positive instances.
- Trade-off: Adjusting the confidence threshold affects precision and recall. A higher threshold increases precision but decreases recall, and vice versa.
-
Confusion Matrix:
- True Positives (TP): Correctly predicted positives.
- True Negatives (TN): Correctly predicted negatives.
- False Positives (FP): Incorrectly predicted positives.
- False Negatives (FN): Incorrectly predicted negatives.
- Usage: Helps visualize the performance of the classification model.
-
Feature Importance:
- Description: Displays how much each feature contributes to the model's predictions.
- Visualization: Typically shown as a bar chart where longer bars indicate higher importance.
- Application: Useful for feature selection and understanding model behavior.
Objective: Understand the steps required to deploy your trained model to an endpoint for serving predictions.
-
Initiate Deployment:
-
Configure Endpoint Details:
- Endpoint Name:
LoanRisk
. - Model Settings:
- Traffic Splitting: Leave as default unless you have specific requirements.
- Machine Type: Choose
e2-standard-8
(8 vCPUs, 32 GiB memory) for robust performance. - Explainability Options: Enable Feature attribution to understand feature contributions.
- Endpoint Name:
-
Ready for Predictions:
- Once deployed, model was ready to serve predictions through the endpoint.
Objective: Obtain a Bearer Token to authenticate and authorize requests to the deployed model endpoint.
Objective: Use the Shared Machine Learning (SML) service to make predictions with your trained model.
Steps:
-
Open Cloud Shell:
-
Set AUTH_TOKEN:
- Replace
INSERT_SML_BEARER_TOKEN
with the token you copied earlier:export AUTH_TOKEN="INSERT_SML_BEARER_TOKEN"
- Replace
-
Download and Extract Lab Assets:
-
Set ENDPOINT Variable:
- Define the endpoint for predictions:
export ENDPOINT="https://sml-api-vertex-kjyo252taq-uc.a.run.app/vertex/predict/tabular_classification"
- Define the endpoint for predictions:
-
Set INPUT_DATA_FILE Variable:
- Define the input data file:
export INPUT_DATA_FILE="INPUT-JSON"
- Define the input data file:
-
Review Lab Assets:
- Files Overview:
- INPUT-JSON: Contains the data for making predictions.
- smlproxy: Application used to communicate with the backend.
- Files Overview:
Steps:
-
Understand INPUT-JSON Structure:
- The
INPUT-JSON
file contains the following columns:- age: Age of the client.
- ClientID: Unique identifier for the client.
- income: Annual income of the client.
- loan: Loan amount requested.
- The
-
Initial Prediction Request:
- Execute the following command to make a prediction:
./smlproxy tabular \ -a $AUTH_TOKEN \ -e $ENDPOINT \ -d $INPUT_DATA_FILE
- Response:
SML Tabular HTTP Response: 2022/01/10 15:04:45 {"model_class":"0","model_score":0.9999981}
- Interpretation:
- model_class:
0
indicates the prediction class (e.g.,0
for repay,1
for default). - model_score: Confidence score of the prediction.
- model_class:
- Execute the following command to make a prediction:
-
Modify INPUT-JSON for a New Scenario:
- Edit the
INPUT-JSON
file to test a different loan scenario:nano INPUT-JSON
- Replace Content:
age,ClientID,income,loan 30.00,998,50000.00,20000.00
- Save and Exit:
- Edit the
-
Make Another Prediction Request:
- Execute the prediction command again:
./smlproxy tabular \ -a $AUTH_TOKEN \ -e $ENDPOINT \ -d $INPUT_DATA_FILE
- Response:
SML Tabular HTTP Response: 2022/01/10 15:04:45 {"model_class":"0","model_score":1.0322887E-5}
- Interpretation:
- A low model_score indicates a high confidence in predicting that the person will repay the loan (
model_class: 0
).
- A low model_score indicates a high confidence in predicting that the person will repay the loan (
- Execute the prediction command again:
Steps:
-
Create Custom Scenarios:
- Modify the
INPUT-JSON
file with different client profiles to see how the model responds.
- Modify the
-
Automate Predictions:
- You can script multiple prediction requests by iterating over different input data files.
-
Analyze Predictions:
- Use the prediction results to understand which client profiles are likely to repay loans and which are at risk of defaulting.
- Vertex AI: Google's unified machine learning platform that enables building, deploying, and scaling ML models.
- Classification: A type of supervised learning where the model predicts categorical labels.
- Regression: A type of supervised learning where the model predicts continuous numerical values.
- Precision: The ratio of true positive predictions to the total predicted positives.
- Recall: The ratio of true positive predictions to the actual positives.
- Confusion Matrix: A table used to evaluate the performance of a classification model by comparing actual vs. predicted labels.
- Feature Importance: A metric that indicates how useful each feature was in the construction of the model.
- Endpoint: A deployed model's API interface that allows serving predictions.
- Bearer Token: An authentication token that grants access to protected resources.
- cURL: A command-line tool used to send HTTP requests, utilized here to interact with APIs.
- Environment Variable: Variables that are set in the operating system to pass configuration information to applications.
- Model Score: Confidence score indicating the probability associated with a prediction.
- Salience: A measure ranging from 0 to 1 indicating the importance of an entity within the context of the text.
- Explainable AI: A set of tools and frameworks to help understand and interpret predictions made by machine learning models.
I successfully completed the "Vertex AI: Predicting Loan Risk" project. Here's a summary of what I accomplished:
- Data Preparation: Uploaded and prepared a tabular dataset for machine learning.
- Model Training: Utilized Vertex AI's AutoML to train a classification model predicting loan repayment risk.
- Model Evaluation: Understood how to evaluate model performance using metrics like precision, recall, and confusion matrix.
- Model Deployment: Learned the steps required to deploy a trained model to an endpoint for serving predictions.
- Authentication: Retrieved a Bearer Token to securely interact with the deployed model.
- Predictions: Made predictions using the deployed model via the SML service, interpreting the results to assess loan repayment risk.
To further enhance this project and my machine learning skills, I plan to:
-
Enhance Model Complexity:
- Experiment with different budget allocations and training times to optimize model performance.
- Explore feature engineering techniques to improve model accuracy.
-
Real-time Predictions:
- Integrate the deployed model into a web application or service to provide real-time loan risk assessments.
-
Model Monitoring:
- Implement monitoring to track model performance over time and detect any degradation or biases.
-
Explore Other Vertex AI Features:
- Utilize Custom Training and Hyperparameter Tuning for more control over the model training process.
- Explore Explainable AI to gain deeper insights into model decisions.
-
Automate Workflows:
- Develop automated pipelines using Vertex AI Pipelines to streamline the model training and deployment process.
-
Expand to Other ML Problems:
- Apply similar methodologies to different classification or regression problems, such as customer churn prediction or sales forecasting.
- Google Cloud Platform: For providing robust and scalable machine learning tools.
- Qwiklabs: For offering hands-on labs that facilitate practical learning experiences.
This project is licensed under the MIT License.