Credit Card Fraud Pipeline Subscription End-to-End Data Pipeline

Bussiness Understanding

Credit score is an important metric for banks to rate the credit performance of their applicants. They use personal information and financial records of credit card applicants to predict whether these applicants will default in the future or not. From these predictions, the banks will then decide if they want to issue credit cards to these applicants or not. The banks are asking us to create an end-to-end pipeline to help them handle this problem. The original datasets and data dictionary can be found in here.

Problem Statements

Financial institution is experiencing challenges in managing and analyzing its large volume of credit card applicant data. This makes it difficult to mitigate fraud from credit card applicant data.

Goal

To mitigate the possibility of fraud on credit card applicant, a data pipeline is created to facilitate data analysis and reporting application record.

Objective

The objectives of this projects are described below:

Design and create end-to-end data pipeline with lambda architecture
Create a data warehouse that can integrate all the credit card applicant data from different sources and provide a single source of truth for the institution's analytics needs
Create a visualization dashboard to get insights from the data, which can be used for business decisions and reach goal from this project.

Data Pipeline

Tools

Cloud : Google Cloud Platform
Infrastructure as Code : Terraform
Containerization : Docker, Docker Compose
Compute : Virtual Machine (VM) instance
Stream Processing : Kafka
Orchestration: Airflow
Transformation : Spark, dbt
Data Lake: Google Cloud Storage
Data Warehouse: BigQuery
Data Visualization: Looker
Language : Python

Reproducibility

Data Visualization Dashboard

Google Cloud Usage Billing Report

Data infrastructure we used in this project are entirely built on Google Cloud Platform with more or less 3 weeks of project duration, using this following services:

Google Cloud Storage (pay for what you use)
Virtual Machine (VM) instance (cost are based Vcpu & memory and storage disk)
Google BigQuery (first terrabyte processed are free of charge)
Google Looker Studio (cost is based from number of Looker Blocks (data models and visualizations), users, and the number of queries processed per month)

Total cost around 6$ out of 300$ free credits that GCP provided

Project Instruction

Clone this repository and enter the directory

git clone https://github.com/archie-cm/final-project-credit-card-fraud-pipeline.git && cd final-project-credit-card-fraud-pipeline

Create a file named "service-account.json" containing your Google service account credentials and copy file to dbt folder

{
  "type": "service_account",
  "project_id": "[PROJECT_ID]",
  "private_key_id": "[KEY_ID]",
  "private_key": "-----BEGIN PRIVATE KEY-----\n[PRIVATE_KEY]\n-----END PRIVATE KEY-----\n",
  "client_email": "[SERVICE_ACCOUNT_EMAIL]",
  "client_id": "[CLIENT_ID]",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/[SERVICE_ACCOUNT_EMAIL]"
}

Cloud Resource Provisioning with Terraform

Install gcloud SDK, terraform CLI, and create a GCP project. Then, create a service account with Storage Admin, Storage Pbject Admin, and BigQuery Admin role. Download the JSON credential and store it on service-account.json. Open terraform/main.tf in a text editor, and fill your GCP's project id.
Enable IAM API and IAM Credential API in GCP.
Change directory to terraform by executing

cd terraform

Initialize Terraform (set up environment and install Google provider)

terraform init

Plan Terraform infrastructure creation

terraform plan

Create new infrastructure by applying Terraform plan

terraform apply

Check GCP console to see newly-created resources.

Batch Pipeline

Setting dbt in profiles.yml
Create batch pipeline with Docker Compose

sudo docker-compose up

Open Airflow with username and password "airflow" to run the DAG

localhost:8090

Open Spark to monitor Spark master and Spark workers

localhost:8080

Streaming Pipeline

Enter directory kafka

cd kafka

Create streaming pipeline with Docker Compose

sudo docker-compose up

Install required Python packages

pip install -r requirements.txt

Run the producer to stream the data into the Kafka topic

python3 producer.py

Run the consumer to consume the data from Kafka topic and load them into BigQuery

python3 consumer.py

Open Confluent to view the topic

localhost:9021

Open Schema Registry to view the active schemas

localhost:8081/schemas