The project is designed to perform clickstream analysis for an e-commerce website. The tasks included in this analysis are listed below.
- Getting data from API and storing in S3 bucket
- Ingesting the data, cleaning the data, storing the data using the medallion structure
- Performing transformations to get desired reports
The project is designed to be run on databricks.
- you can install pyspark to your local machine using
pip install pyspark
in your terminal
Once you have setup a databricks workspace, you can follow these steps -
- Clone this repository into the workspace using the "repo" tab.
- Move to the
src
folder. - Open
main.py
- Assuming you have a cluster setup on databricks, you can connect this python script to a cluster and run the code.
If you're having issues creating a cluster, you can find more information here
The code flow is as follows.
- Update bronze layer (named raw_data) stored in S3 Bucket by calling API.
- Update silver layer by performing basic transformations on the transactions table and store it in S3 partitioned by day.
- Perform transformations to calculate the month over month sales for specified items.
- Store the sales report in the gold layer (named results) in S3 bucket.
The code calculates the month over month sales report for a specific product. This sales report is stored in the gold layer (named results) in the s3 bucket.