Data-Engineering-Challenge

In this challenge, I solved a data engineering problem. Mainly, it involved doing complex feature engineering on a big data set. I used the following stack to solve the given problem:

Pypspark 2.4 (for doing all feature engineering)
Docker (for launching spark cluster in a local mode)

The notebook above is a good reference point if you want to see how pandas + pyspark is used together to create complex features.

saraswatmks/Data-Engineering-Challenge

Data-Engineering-Challenge