- BigQuery : Data is queried from bq.
- Secret Manager : Secrets for service account to data access are managed by Secret Manager
- Python & Pandas:Pandas used for get data from BigQuery, python used for all data process & cleaning
- Data Processing, Data Cleaning, Model Training: Data is processed, cleaned, and a model is trained.
- Performance Evaluation, Model Selection: The model's performance is evaluated, and the best model is selected.
- FastAPI: The selected model is wrapped in a FastAPI application.
- Docker: The FastAPI application is containerized using Docker.
- Artifact Registry: The Docker image is stored in the Artifact Registry.
- GitHub: The code is pushed to a GitHub repository.
- GitHub Actions: CI/CD processes are managed by GitHub Actions.
- Test Phase: Model performance tests are run. If tests fail, an email is sent. If tests pass, the model is deployed to Cloud Run, and a URL is generated.
- Cloud Run: The successfully tested model is deployed to Cloud Run and made accessible via a URL.
In this project, power users were identified through a detailed process involving feature engineering, model selection, and evaluation. The methodology used for identifying power users is outlined below:
- Data Augmentation: Additional features were created, such as the average number of products in a basket, to enrich the dataset and improve model performance.
- Target Column: Initially, power users were identified based on a z-score method. However, this method resulted in very few power users, making it insufficient. Therefore, based on the distributions, users who spent more than $110 were identified as power users, forming the basis for the classification target.
- Training Dataset Preparation: The training dataset was adjusted using various oversampling methods to handle class imbalances.
- Oversampling Techniques: Methods like RandomOverSample, SMOTE, and ADASYN were employed to balance the dataset, ensuring that the models could effectively learn from both the majority and minority classes.
- Hyperparameter Optimization: GridSearchCV was used for hyperparameter tuning to find the best model configurations.
- Comparison of Models: Several models were compared based on their performance metrics, including KNN, XGBoost, Logistic Regression, and Random Forest. The XGBoost model with ADASYN oversampling showed the best performance.
Model | Recall | Precision | Log Loss |
---|---|---|---|
KNN | 51.65% | 88.70% | 6.88% |
KNN_RandomOverSample | 47.25% | 23.50% | 15.45% |
KNN_SMOTE | 62.64% | 19.40% | 19.13% |
KNN_ADASYN | 28.57% | 35.60% | 24.31% |
XGBoost | 71.43% | 100.00% | 5.23% |
XGBoost_RandomOverSample | 71.43% | 55.60% | 2.70% |
XGBoost_SMOTE | 71.43% | 97.00% | 1.41% |
XGBoost_ADASYN | 71.43% | 100.00% | 1.33% |
Logistic Regression | 67.03% | 89.70% | 1.21% |
Logistic Regression_RandomOverSample | 78.02% | 10.50% | 14.46% |
Logistic Regression_SMOTE | 76.92% | 11.60% | 12.64% |
Logistic Regression_ADASYN | 84.62% | 6.40% | 24.08% |
RandomForest | 70.33% | 100.00% | 4.49% |
RandomForest_RandomOverSample | 63.74% | 87.90% | 5.12% |
RandomForest_SMOTE | 72.53% | 42.00% | 3.16% |
RandomForest_ADASYN | 74.73% | 32.70% | 4.43% |
- Threshold Tuning: The threshold value of the model was adjusted to optimize performance. Despite testing various thresholds, the default value of 0.5 was retained as it provided the best balance between recall and precision.
- ROC AUC Curve Analysis: The performance of the XGBoost ADASYN model was further validated using the ROC AUC curve, demonstrating strong predictive capabilities.
-
Build the Docker image: Use the Dockerfile provided to create an image tagged as
xgboost_adasyn_poweruser_image:v1
.docker build -t xgboost_adasyn_poweruser_image:v1 .
-
Run the image locally: Verify that the image was built correctly by running it locally.
docker run -d -p 8000:8000 --name xgboost_container xgboost_adasyn_poweruser_image:v1
-
Tag the image: Before pushing the image to Docker Hub, tag it appropriately with your Docker Hub username and repository name.
docker tag xgboost_adasyn_poweruser_image:v1 yaseminbellioglu/xgboost_adasyn_poweruser_image:v1
-
Push the image: Upload the image to your Docker Hub repository, making it available for deployment on other machines.
docker push yaseminbellioglu/xgboost_adasyn_poweruser_image:v1
Cloud Run is used because it provides automatic scaling, simple deployment, and cost efficiency by only charging for actual usage. Unlike VMs, it eliminates the need for manual management and maintenance, allowing for easier integration with other Google Cloud services.
First, create an Artifact Registry repository to store the Docker image. Follow these steps to push your Docker image to Artifact Registry:
-
Authenticate with Google Cloud: Ensure you are logged in to your Google Cloud account.
gcloud auth login
-
Set Your Project: Configure your project settings.
gcloud config set project PROJECT_ID
-
Configure Docker: Use the gcloud command-line tool to authenticate requests to Artifact Registry.
gcloud auth configure-docker us-central1-docker.pkg.dev
-
Build and Tag Your Docker Image: Build the Docker image and tag it with the appropriate name.
docker build -t xgboost_adasyn_poweruser_image:v1 . docker tag xgboost_adasyn_poweruser_image:v1 us-central1-docker.pkg.dev/psychic-root-424207-s9/xgboost/xgboost_adasyn_poweruser_image:cloudingv1 docker push us-central1-docker.pkg.dev/psychic-root-424207-s9/xgboost/xgboost_adasyn_poweruser_image:cloudingv1
Before setting up the CI/CD pipeline, ensure that the service account used for GitHub Actions has the following IAM roles assigned:
- Artifact Registry Administrator
- Artifact Registry Writer
- BigQuery Admin
- Cloud Run Admin
- Editor
- Secret Manager Secret Accessor
- Service Account User
You can assign these roles in the Google Cloud Console under IAM & Admin > IAM by editing the permissions for your service account.
The CI/CD process for PowerUser CR is managed using GitHub Actions. The workflow is defined in the .github/workflows/docker-image.yml
file and includes steps for building, testing, and deploying the application.
Here is the complete workflow file:
name: CI/CD
on:
push:
branches: [ main ]
jobs:
build_and_test:
runs-on: ubuntu-latest
steps:
- name: Checkout Repo ##Checkout the Code: GitHub Actions checks out the source code from the GitHub repository
uses: actions/checkout@v2
- name: Set up Python ##Set up Python: Python 3.11 is set up
uses: actions/setup-python@v2
with:
python-version: '3.11'
- name: Install dependencies ##Install Dependencies: Required Python packages are installed
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run tests ##Run Tests: The application's tests are executed
run: |
python -m pytest
deploy:
runs-on: ubuntu-latest
needs: build_and_test
steps:
- name: Checkout Repo ##Checkout the Code: The source code is checked out again for the deployment job.
uses: actions/checkout@v2
- name: Set up gcloud CLI ##Set up gcloud CLI: gcloud CLI is set up for Google Cloud authentication
uses: google-github-actions/auth@v1
with:
project_id: ${{ secrets.PROJECT_ID }}
credentials_json: ${{ secrets.CREDENTIALS_JSON }}
- name: Build and push container image ##Build and Push Docker Image: The Docker image is built and pushed to Artifact Registry
env:
PROJECT_ID: ${{ secrets.PROJECT_ID }}
run: |
gcloud auth configure-docker us-central1-docker.pkg.dev
docker build -t us-central1-docker.pkg.dev/${PROJECT_ID}/xgboost/xgboost_adasyn_poweruser_image:cloudingv1 .
docker push us-central1-docker.pkg.dev/${PROJECT_ID}/xgboost/xgboost_adasyn_poweruser_image:cloudingv1
- name: Deploy to Cloud Run ##Deploy to Cloud Run: The application is deployed to Cloud Run with the necessary settings.
run: |
gcloud run deploy xgboost \
--image=us-central1-docker.pkg.dev/${{ secrets.PROJECT_ID }}/xgboost/xgboost_adasyn_poweruser_image:cloudingv1 \
--allow-unauthenticated \
--port=8000 \
--service-account=${{ secrets.SERVICE_ACCOUNT }} \
--max-instances=5 \
--region=us-central1 \
--project=${{ secrets.PROJECT_ID }}
Use the provided URL to interact with your API. The URL will look something like this: