Credit score is an important metric for banks to rate the credit performance of their applicants. They use personal information and financial records of credit card applicants to predict whether these applicants will default in the future or not. From these predictions, the banks will then decide if they want to issue credit cards to these applicants or not. The banks are asking us to create an end-to-end pipeline to help them handle this problem. The original datasets and data dictionary can be found in here.
Financial institution is experiencing challenges in managing and analyzing its large volume of credit card applicant data. This makes it difficult to mitigate fraud from credit card applicant data.
To mitigate the possibility of fraud on credit card applicant, a data pipeline is created to facilitate data analysis and reporting application record.
The objectives of this projects are described below:
- Design and create end-to-end data pipeline with lambda architecture
- Create a data warehouse that can integrate all the credit card applicant data from different sources and provide a single source of truth for the institution's analytics needs
- Create a visualization dashboard to get insights from the data, which can be used for business decisions and reach goal from this project.
- Cloud : Google Cloud Platform
- Infrastructure as Code : Terraform
- Containerization : Docker, Docker Compose
- Compute : Virtual Machine (VM) instance
- Stream Processing : Kafka
- Orchestration: Airflow
- Transformation : Spark, dbt
- Data Lake: Google Cloud Storage
- Data Warehouse: BigQuery
- Data Visualization: Looker
- Language : Python
Data infrastructure we used in this project are entirely built on Google Cloud Platform with more or less 3 weeks of project duration, using this following services:
- Google Cloud Storage (pay for what you use)
- Virtual Machine (VM) instance (cost are based Vcpu & memory and storage disk)
- Google BigQuery (first terrabyte processed are free of charge)
- Google Looker Studio (cost is based from number of Looker Blocks (data models and visualizations), users, and the number of queries processed per month)
Total cost around 6$ out of 300$ free credits that GCP provided
git clone https://github.com/archie-cm/final-project-credit-card-fraud-pipeline.git && cd final-project-credit-card-fraud-pipeline
Create a file named "service-account.json" containing your Google service account credentials and copy file to dbt folder
{
"type": "service_account",
"project_id": "[PROJECT_ID]",
"private_key_id": "[KEY_ID]",
"private_key": "-----BEGIN PRIVATE KEY-----\n[PRIVATE_KEY]\n-----END PRIVATE KEY-----\n",
"client_email": "[SERVICE_ACCOUNT_EMAIL]",
"client_id": "[CLIENT_ID]",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/[SERVICE_ACCOUNT_EMAIL]"
}
-
Install
gcloud
SDK,terraform
CLI, and create a GCP project. Then, create a service account with Storage Admin, Storage Pbject Admin, and BigQuery Admin role. Download the JSON credential and store it onservice-account.json
. Openterraform/main.tf
in a text editor, and fill your GCP's project id. -
Enable IAM API and IAM Credential API in GCP.
-
Change directory to
terraform
by executing
cd terraform
- Initialize Terraform (set up environment and install Google provider)
terraform init
- Plan Terraform infrastructure creation
terraform plan
- Create new infrastructure by applying Terraform plan
terraform apply
- Check GCP console to see newly-created resources.
-
Setting dbt in profiles.yml
-
Create batch pipeline with Docker Compose
sudo docker-compose up
- Open Airflow with username and password "airflow" to run the DAG
localhost:8090
- Open Spark to monitor Spark master and Spark workers
localhost:8080
- Enter directory kafka
cd kafka
- Create streaming pipeline with Docker Compose
sudo docker-compose up
- Install required Python packages
pip install -r requirements.txt
- Run the producer to stream the data into the Kafka topic
python3 producer.py
- Run the consumer to consume the data from Kafka topic and load them into BigQuery
python3 consumer.py
- Open Confluent to view the topic
localhost:9021
- Open Schema Registry to view the active schemas
localhost:8081/schemas