- pyenv (recommended)
root/
|-- dags/
| |-- project.py
|-- src/
| |-- project/
| |-- |-- common/
| |-- |-- |-- spark.py
| |-- |-- jobs/
| |-- |-- transformations/
| |-- app.py
|-- tests/
| |-- common/
| |-- | -- spark.py
| Dockerfile
| setup.py
The main Python module contains the ETL job app.py
. By default app.py
accepts a number of arguments:
--date
the execution date--env
the environment we are executing in--jobs
one or more jobs that needs to be executed
In building your Python application and its dependencies for production, you want to make sure that your builds are predictable and deterministic. Therefore, always pin your dependencies. You can read more in the article: Better package management
When using pip-tools to manage dependencies, you define your dependencies in the requirements.in
file.
This file can then be compiled into the requirements.txt
file by running the command pip-compile requirements.in
from your shell.
This compilation step makes sure every dependency gets pinned in the requirements.txt
file,
ensuring that project won't break because of transitive dependencies being silently updated.
When a dependency does need to be updated, you can update the requirements.in
file and re-compile it.
With this method, package updates always happen as a conscious decision by the developer.
The pip-compile
command should be run from the same virtual environment as your project so conditional dependencies that require a specific Python version,
or other environment markers, resolve relative to your project's environment.
If you want to run another job in your spark application create a file like app.py. You should:
- Use argparse (or something similar) to parse argument to pass to your job
- Have a main function that can be called
- Make sure you have
if __name__ == "__main__"
construct in your file like below - Use your job file in the dag
The following python snippet makes sure that if you call this module from the command lind that the main() function will be executed:
if __name__ == "__main__":
main()
Setup virtual environment:
pyenv local
to use a correct python versionpython -m venv venv
to create a virtual environmentsource ./venv/bin/activate
to activate the virtual environmentpip install pip-tools
to install pip tools
Tasks:
pip install -r requirements.txt
to install dependenciespip install -r dev-requirements.txt
to install development dependenciespip install -e .
to install the project in editable modepython -m pytest --cov=src tests
runs all the tests and check coveragepython -m black dags src tests --check
checks PEP8 compliance issuespython -m black dags src tests
fixes PEP8 compliance issuespip-compile requirements.in
if you add new requirements this regenerates a new requirements.txtpip-compile dev-requirements.in
if you add new requirements this regenerates a new dev-requirements.txt, you should also do this when have updated your requirements.in