- Task content
- Project description
- Technologies and dependencies
- Requirements
- Build instruction
- Status
- Contact
Schema:
- userId: String
- eventId: String
- eventTime: Timestamp
- eventType: String
- attributes: Map[String, String]
There could be events of the following types that form a user engagement session:
- app_open
- search_product
- view_product_details
- purchase
- app_close
Events of app_open type may contain the attributes relevant to the marketing analysis:
- campaign_id
- channel_id
Events of purchase type contain purchase_id attribute.
Schema:
- purchaseId: String
- purchaseTime: Timestamp
- billingCost: Double
- isConfirmed: Boolean
Target schema:
- purchaseId: String
- purchaseTime: Timestamp
- billingCost: Double
- isConfirmed: Boolean
- sessionId: String // a session starts with app_open event and finishes with app_close
- campaignId: String // derived from app_open#attributes#campaign_id
- channelIid: String // derived from app_open#attributes#channel_id
Task #2.1. Top Campaigns:
- Top 10 marketing campaigns that bring the biggest revenue (based on billingCost of confirmed purchases)
Task #2.2. Channels engagement performance:
- Most popular (i.e. Top) channel that drives the highest amount of unique sessions (engagements) with the App in each campaign
Project consists of the following directories:
- bigdata-input-generator
This directory content is necessary to generate sample input data for project analyze marketing data purposes - configs
It's a directory to store any external configuration parameters required by spark jobs in JSON format - dependencies
Here you can find some general project dependencies (global spark configuration, logger setup etc.) - jobs
This directory consists of spark app created jobs - tests
Here you can find some spark app tests
- Python 3.8
- PySpark 3.1.2
- Git
- Python 3.8 (or higher) and pip3 (package-management system)
To run project, follow these steps:
- Open terminal and clone the project from github repository:
$ git clone https://github.com/mkrolczyk12/python-spark.git
$ cd <project_cloned_folder>
where <project_cloned_folder> is a path to project root directory
- Create and activate virtualenv:
- If no virtualenv package installed, run:
$ python3 -m pip install --upgrade pip
$ pip3 install virtualenv
- Then
$ python3 -m venv ENV_NAME
where 'ENV_NAME' is the name of env
- Activate virtualenv
$ source ./ENV_NAME/bin/activate
- Generate sample input data for your project:
- Install the required python packages and run the main script from the bigdata-input-generator directory to generate the datasets
(ENV_NAME)$ pip3 install -r ./bigdata-input-generator/requirements.txt
(ENV_NAME)$ python3 ./bigdata-input-generator/main.py
- Install project required dependencies:
(ENV_NAME)$ pip3 install -r ./requirements.txt
- Check if everything works by running tests
(ENV_NAME)$ python3 -m tests.test_task1
(ENV_NAME)$ python3 -m tests.test_task2
- Run main script
(ENV_NAME)$ python3 -m jobs.main
The app should be ready to use.
completed
Created by @mkrolczyk12 - feel free to contact me!
- E-mail: m.krolczyk66@gmail.com