DE Capstone Project - Team MooTech

Problem Statement

The project is designed to perform clickstream analysis for an e-commerce website. The tasks included in this analysis are listed below.

Getting data from API and storing in S3 bucket
Ingesting the data, cleaning the data, storing the data using the medallion structure
Performing transformations to get desired reports

Getting Started

The project is designed to be run on databricks.

Requirements

pyspark

you can install pyspark to your local machine using pip install pyspark in your terminal

Running the code

Once you have setup a databricks workspace, you can follow these steps -

Clone this repository into the workspace using the "repo" tab.
Move to the src folder.
Open main.py
Assuming you have a cluster setup on databricks, you can connect this python script to a cluster and run the code.

If you're having issues creating a cluster, you can find more information here

Process

The code flow is as follows.

Update bronze layer (named raw_data) stored in S3 Bucket by calling API.
Update silver layer by performing basic transformations on the transactions table and store it in S3 partitioned by day.
Perform transformations to calculate the month over month sales for specified items.
Store the sales report in the gold layer (named results) in S3 bucket.

Expected Output

The code calculates the month over month sales report for a specific product. This sales report is stored in the gold layer (named results) in the s3 bucket.

BLEND360/AllStar-ClickStream-Analysis---Mootech