/final-project-credit-card-fraud-pipeline

Final Project create an end-to-end credit card fraud pipeline using lambda architecture (providing access to batch and stream processing)

Primary LanguageHTML

Credit Card Fraud Pipeline Subscription End-to-End Data Pipeline

Bussiness Understanding

Credit score is an important metric for banks to rate the credit performance of their applicants. They use personal information and financial records of credit card applicants to predict whether these applicants will default in the future or not. From these predictions, the banks will then decide if they want to issue credit cards to these applicants or not. The banks are asking us to create an end-to-end pipeline to help them handle this problem. The original datasets and data dictionary can be found in here.

Problem Statements

Financial institution is experiencing challenges in managing and analyzing its large volume of credit card applicant data. This makes it difficult to mitigate fraud from credit card applicant data.

Goal

To mitigate the possibility of fraud on credit card applicant, a data pipeline is created to facilitate data analysis and reporting application record.

Objective

The objectives of this projects are described below:

  • Design and create end-to-end data pipeline with lambda architecture
  • Create a data warehouse that can integrate all the credit card applicant data from different sources and provide a single source of truth for the institution's analytics needs
  • Create a visualization dashboard to get insights from the data, which can be used for business decisions and reach goal from this project.

Data Pipeline

image

Tools

  • Cloud : Google Cloud Platform
  • Infrastructure as Code : Terraform
  • Containerization : Docker, Docker Compose
  • Compute : Virtual Machine (VM) instance
  • Stream Processing : Kafka
  • Orchestration: Airflow
  • Transformation : Spark, dbt
  • Data Lake: Google Cloud Storage
  • Data Warehouse: BigQuery
  • Data Visualization: Looker
  • Language : Python

Reproducibility

Screenshot (189)

Data Visualization Dashboard

Screenshot (188)

Screenshot (187)

Google Cloud Usage Billing Report

Data infrastructure we used in this project are entirely built on Google Cloud Platform with more or less 3 weeks of project duration, using this following services:

  • Google Cloud Storage (pay for what you use)
  • Virtual Machine (VM) instance (cost are based Vcpu & memory and storage disk)
  • Google BigQuery (first terrabyte processed are free of charge)
  • Google Looker Studio (cost is based from number of Looker Blocks (data models and visualizations), users, and the number of queries processed per month)

Total cost around 6$ out of 300$ free credits that GCP provided

Project Instruction

Clone this repository and enter the directory

git clone https://github.com/archie-cm/final-project-credit-card-fraud-pipeline.git && cd final-project-credit-card-fraud-pipeline

Create a file named "service-account.json" containing your Google service account credentials and copy file to dbt folder

{
  "type": "service_account",
  "project_id": "[PROJECT_ID]",
  "private_key_id": "[KEY_ID]",
  "private_key": "-----BEGIN PRIVATE KEY-----\n[PRIVATE_KEY]\n-----END PRIVATE KEY-----\n",
  "client_email": "[SERVICE_ACCOUNT_EMAIL]",
  "client_id": "[CLIENT_ID]",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/[SERVICE_ACCOUNT_EMAIL]"
}

Cloud Resource Provisioning with Terraform

  1. Install gcloud SDK, terraform CLI, and create a GCP project. Then, create a service account with Storage Admin, Storage Pbject Admin, and BigQuery Admin role. Download the JSON credential and store it on service-account.json. Open terraform/main.tf in a text editor, and fill your GCP's project id.

  2. Enable IAM API and IAM Credential API in GCP.

  3. Change directory to terraform by executing

cd terraform
  1. Initialize Terraform (set up environment and install Google provider)
terraform init
  1. Plan Terraform infrastructure creation
terraform plan
  1. Create new infrastructure by applying Terraform plan
terraform apply
  1. Check GCP console to see newly-created resources.

Batch Pipeline

  1. Setting dbt in profiles.yml

  2. Create batch pipeline with Docker Compose

sudo docker-compose up
  1. Open Airflow with username and password "airflow" to run the DAG
localhost:8090

image

image

  1. Open Spark to monitor Spark master and Spark workers
localhost:8080

image

Streaming Pipeline

  1. Enter directory kafka
cd kafka
  1. Create streaming pipeline with Docker Compose
sudo docker-compose up
  1. Install required Python packages
pip install -r requirements.txt
  1. Run the producer to stream the data into the Kafka topic
python3 producer.py
  1. Run the consumer to consume the data from Kafka topic and load them into BigQuery
python3 consumer.py

Screenshot (190)

  1. Open Confluent to view the topic
localhost:9021

image

  1. Open Schema Registry to view the active schemas
localhost:8081/schemas

image