In this challenge, I solved a data engineering problem. Mainly, it involved doing complex feature engineering on a big data set. I used the following stack to solve the given problem:
- Pypspark 2.4 (for doing all feature engineering)
- Docker (for launching spark cluster in a local mode)
The notebook above is a good reference point if you want to see how pandas + pyspark is used together to create complex features.