iamtodor/data-engineering-test-tasks

PythonMIT

My portfolio with test tasks from different companies for Data Engineer

All the code has been formatted by Black: The Uncompromising Code Formatter

Configured GitHub actions

Dependabot checks on weekly basis
After each commit GitHub workflows run the following checks:

Task 1

Description: calculate pyspark aggregations from the given csv.

Tech:

python
spark
csv

Task 2

Description: calculate pyspark aggregations from the given parquet and csv.

Tech:

python
spark
csv

Task 3

Description: calculate pyspark aggregations from the given csv.

Tech:

python
spark
csv

Task 4

Description:

calculate pyspark aggregations from the given parquet
ingest the data to postgres
read the data from postgres
calculate pyspark aggregations and save as cvs

Tech:

python
spark
parquet
postgres in docker with persistent storage

Task 5

Description:

calculate pyspark metrics and dimensions aggregations from given json
test the app

Tech:

python
spark
pytest: 91% test coverage according to Coverage
json/parquet

Task 6

Very small task, PySpark, no sense to split it to separate functions and test them.

remove non-ascii characters
drop duplicates bases on dt column

Kafka pet project

The project itself is another GitHub repo. The purpose of the project is to prove Java, Kafka, Prometheus and Grafana knowledge.