The objective of this assignment is to implement a distributed system that handles csv data, applies transformations and feature engineering and persists them in parquet format. The system should consist of the following two modules:
- A distributed file system (Hadoop cluster) where the csv dataset-file will be stored and the resulting parquet files will be persisted.
- A spark cluster that will run on top of Hadoop and will process the csv data in order to generate new features that will be stored in parquet files.
- Docker has to be installed.
- This is a private git for personal purposes.
- Download this repository with the command:
git clone https://github.com/ChristinaManara/Feature-Engineering-with-PySpark.git
- Navigate to the downloaded repository with the command:
cd pathTo/Feature-Engineering-with-PySpark
- Build containers with the command:
bash build_all.sh
- Deploy an HDFS-Spark cluster with the command:
docker-compose up -d
Web UI of Hadoop: http://localhost:9870
Web UI of Spark: http://localhost:8080
You can submit a Word Count or Feature Engineering job with:
- Navigate to apps folder with the command:
cd apps
- Run the command and a menu will be shown:
bash run.sh
Any advise for common problems or issues.
Try to restart the containers. If WebUI of Hadoop or Spark is not showing at first place, just wait a few seconds in order to finish the deployment.
Creator:
ex. Christina Manara (christinamanara2@gmail.com)
- 0.1
- Initial Release
Inspiration, code snippets, etc.