This is a sample pyspark application developed following OOPS programming model and Programming Designs such as ABC, Factory and Singleton Desing Patterns. The goal behind developing this application is to provide real practical exposure to Pyspark and python coding. The goal is to create it as a full pyspark application that will use following tools to run it.
- Docker
- Airflow
- Kubernetes
This repo contains more than one way to do same thing for example in order to fetch configs and Constants, I am using following approach here:
- config file with extension .cfg/ini read by parsed by configparser in config_utils.py
- Constants.py containing dictionary -- This has been replaced by Enum
- Constants.py containing Enum class
Setting up current project to the PYTHONPATH
export PYTHONPATH=$PYTHONPATH:/home/archana/Desktop/git-repo/pyspark_framework
Creating requirements.txt file
pip3 freeze > requirements.txt #Python3
In order to avoid relative path issue, I have used package resources api from setuptools. This change was added as part of pr 37.
How to run this app from command line:
export PYTHONPATH=$PYTHONPATH:/Users/my-user-name/path-to-code-repo
python3 src/app/app.py --job_name air_asia_data_job
In order to run the job from individual files for testing purpose. Use this path for configUtil
config_path: str = "../config/pipeline.cfg"
All three pipelines runs successfully python3 src/app/app.py --job-name happiness_index_job python3 src/app/app.py --job-name bmi_data_job python3 src/app/app.py --job-name air_asia_data_job
makefile vs shell script https://unix.stackexchange.com/questions/496793/script-or-makefile-to-automate-new-user-creation/497601#497601
spark-submit \
--jars jars/any-jar_0.1-0.1.1.jar \
--py-files datajob.zip \
src/app/app.py \
--job-name air_asia_data_job
While running spark-submit in spark jupyter docker container. I was getting error:
Jupyter command file/path
not found
Below stackoverflow article answers this problem. That has to do with PYSPARK_DRIVER_PYTHON=jupyter
Which should be set to PYSPARK_DRIVER_PYTHON=python.
export PYSPARK_DRIVER_PYTHON=python export PYTHONPATH=PYTHONPATH:spark_etl/
How to zip pyton files for a spark job? https://janetvn.medium.com/how-to-add-multiple-python-custom-modules-to-spark-job-6a8b943cdbbc
What else to do
- CI-CD
- Secrets Manager/ Hashicorp Vault
- Spark Streaming
- Spark ML
- Graph Processing
- Python Shiny Graph
- Airflow DAG
- Packaging and Distribution
List of websites that provide Real time data.
- https://data.nasa.gov
- flight status api
- https://developer.accuveather.com
- Twitter APIs
- News APIs
List of Streaming Data for Spark Streaming.
How to enforce schema on Pyspark Job? https://github.com/MrPowers/quinn/blob/main/quinn/dataframe_validator.py https://stackoverflow.com/questions/63040466/enforcing-schema-for-pyspark-job
How to write a single csv file without a Folder? https://stackoverflow.com/questions/43661660/spark-how-to-write-a-single-csv-file-without-folder/60442604#60442604
http://ocelma.net/MusicRecommendationDataset/lastfm-1K.html
Why config directory should be out of src?
https://github.com/orgs/community/discussions/25389
What is decorator design pattern? Provide at least one example of Decorator Design Pattern in this code?
https://github.com/mehd-io/pyspark-boilerplate-mehdio/blob/master/docker/script/package_zip.sh
The pytest command runs the pytest testing framework on the specified modules.
The -vvv flag sets the verbosity level to "very very verbose", which means that pytest will print a lot of information about each test.
The -rf flag shows "only failed tests and any important information", suppressing the output for passing tests.
The -q flag suppresses extra information when running tests.
The --cov flag enables code coverage measurement with pytest-cov.
The --cov-report term flag specifies that coverage statistics should be printed to the terminal in a simple format.
The
So when you run make test-unit, it will execute the pytest command with the specified options, running the unit tests and measuring code coverage for the specified modules.
https://stackoverflow.com/questions/25436312/gitignore-not-working
Using function as a decorator: Using @property decorator Implement @abstractmethod implement @staticmethod
Implement ABC Use @dataclass(frozen = true)