Data Engineering 2024 project for DataEngineer-Zoomcamp. Developed an end-to-end data pipeline for an Indonesian e-commerce website Fashion Campus.
Data is extracted from Kaggle - https://www.kaggle.com/datasets/latifahhukma/fashion-campus/data?select=transactions.csv
Information about the data - Fashion Campus, an e-commerce fashion company targeting "Indonesian Young Urbans" - young people aged 15-35 - was established in Indonesia in early 2016. The company offers a catalog of local and international brands popular among young people. Given that the data is static, the data pipeline operates as a one-time process. The dataset contains 4 CSV files
- Clickstream
- Transactions
- Product
- Customer
Develop a data architecture from the raw data of the Fashion Campus using Google Cloud Platform. The data is extracted from Kaggle, inital data ingestion and workflow orchestration is done through Mage. Final ETL pipeline is developed in DBT. When data is stored in the warehouse i.e. Bigquery, then visualization for business is done through Looker.
-
Cloud:
-
Data Ingestion (batch):
- Mage
- Batch data ingestion is done through Mage, as it makes easy to handle big data and the data gets stored in data lake in batches.
-
Data Lake:
- Google Cloud Storage
- When data is ingested and processed from Mage, it is stored in google cloud storage. As it is a cloud platform, it becomes easy to access the data for further processing.
-
Data Transformations and Processing:
-
DBT is used for the development of the ETL of the data. Developed staging tables for the files which are further joined into a fact table.
-
Further dimensions are created according to the requirement and then data is pushed into data warehouse in batches.
-
Data Warehousing:
- Google BigQuery
- Data from both dev and prod environment is stored in bigquery. This can easily help us in writing adhoc SQL scripting and also provides data for visualization in looker
-
Dashboarding:
-
Check out the dashboards below
-
Fashion Campus Order Analysis - https://lookerstudio.google.com/u/0/reporting/bd6e5b38-1d02-4395-9b30-395046c28f68/page/OoIxD?s=kgU0Du65M1k
-
Fashion Campus Product Details - https://lookerstudio.google.com/s/kQqG5WWwpHo
- Creating CI/CD pipeline on DBT, so that data can be merged easily on git.
- Developing further visualizations of clickstream to retain customers.
- Developing further dimensions of the ETL architecture to generate niche data.