Feature-Engineering-with-PySpark

Hadoop Single Node Cluster on Docker.

Description

The objective of this assignment is to implement a distributed system that handles csv data, applies transformations and feature engineering and persists them in parquet format. The system should consist of the following two modules:

A distributed file system (Hadoop cluster) where the csv dataset-file will be stored and the resulting parquet files will be persisted.
A spark cluster that will run on top of Hadoop and will process the csv data in order to generate new features that will be stored in parquet files.

Getting Started

Dependencies

Docker has to be installed.

Installing

This is a private git for personal purposes.

Executing program

Download this repository with the command:

 git clone https://github.com/ChristinaManara/Feature-Engineering-with-PySpark.git

Navigate to the downloaded repository with the command:

cd pathTo/Feature-Engineering-with-PySpark

Build containers with the command:

bash build_all.sh

Deploy an HDFS-Spark cluster with the command:

docker-compose up -d

Web UI of Hadoop: http://localhost:9870

It should be like:

Web UI of Spark: http://localhost:8080

It should be like:

You can submit a Word Count or Feature Engineering job with:

Navigate to apps folder with the command:

cd apps

Run the command and a menu will be shown:

bash run.sh

Help

Any advise for common problems or issues.

Try to restart the containers. If WebUI of Hadoop or Spark is not showing at first place, just wait a few seconds in order to finish the deployment.

Authors

Creator:

ex. Christina Manara (christinamanara2@gmail.com)

Version History

0.1
- Initial Release

Acknowledgments

Inspiration, code snippets, etc.