Gabriele Degola, June 2022
This project simulates concrete data engineering scenario, regarding generation of business intelligence reports and delivery of data insights, and applied machine learning.
Solutions are developed using the Python language on top of Apache Spark, leveraging the RDD API, the DataFrame API and the MLlib library. To download and install Spark, refer to the official documentation.
Task 2.5 is solved through Apache Airflow.
This git repo is organized as follows:
.
..
data/
src/
out/
README.md
data/
directory contains the datasets used in the different exercises.src/
directory contains the source code files, named astask_x_y.py
(solution of partx
, tasky
). Solutions are described in the associatedREADME
file.out/
directory contains output files, named following the same convention.
Three datasets are used in total, one for each part of the challenge:
groceries.csv
: shopping transactions, incsv
formatsf-airbnb-clean.parquet
: small version of the AirBnB dataset, inparquet
formatiris.csv
: the classic iris dataset, incsv
format
All solutions are designed to be run through the spark-submit
command on a local Spark cluster with a single worker thread.
spark-submit task_x_y.py path/to/input/file.txt path/to/output/file.txt
Specific instructions are returned and contained in each Python script.