My portfolio with test tasks from different companies for Data Engineer

All the code has been formatted by Black: The Uncompromising Code Formatter

Configured GitHub actions

  1. Dependabot checks on weekly basis

  2. After each commit GitHub workflows run the following checks:

Description: calculate pyspark aggregations from the given csv.

Tech:

  • python
  • spark
  • csv

Description: calculate pyspark aggregations from the given parquet and csv.

Tech:

  • python
  • spark
  • csv

Description: calculate pyspark aggregations from the given csv.

Tech:

  • python
  • spark
  • csv

Description:

  • calculate pyspark aggregations from the given parquet
  • ingest the data to postgres
  • read the data from postgres
  • calculate pyspark aggregations and save as cvs

Tech:

  • python
  • spark
  • parquet
  • postgres in docker with persistent storage

Description:

  • calculate pyspark metrics and dimensions aggregations from given json
  • test the app

Tech:

  • python
  • spark
  • pytest: 91% test coverage according to Coverage
  • json/parquet

Very small task, PySpark, no sense to split it to separate functions and test them.

  • remove non-ascii characters
  • drop duplicates bases on dt column

Kafka pet project

The project itself is another GitHub repo. The purpose of the project is to prove Java, Kafka, Prometheus and Grafana knowledge.