Feature-Engineering-with-PySpark

Hadoop Single Node Cluster on Docker.

Description

The objective of this assignment is to implement a distributed system that handles csv data, applies transformations and feature engineering and persists them in parquet format. The system should consist of the following two modules:

  1. A distributed file system (Hadoop cluster) where the csv dataset-file will be stored and the resulting parquet files will be persisted.
  2. A spark cluster that will run on top of Hadoop and will process the csv data in order to generate new features that will be stored in parquet files.

Getting Started

Dependencies

  • Docker has to be installed.

Installing

  • This is a private git for personal purposes.

Executing program

  • Download this repository with the command:
 git clone https://github.com/ChristinaManara/Feature-Engineering-with-PySpark.git
  • Navigate to the downloaded repository with the command:
cd pathTo/Feature-Engineering-with-PySpark
  • Build containers with the command:
bash build_all.sh
  • Deploy an HDFS-Spark cluster with the command:
docker-compose up -d

Web UI of Hadoop: http://localhost:9870

It should be like: alt text

Web UI of Spark: http://localhost:8080

It should be like: alt text

You can submit a Word Count or Feature Engineering job with:

  • Navigate to apps folder with the command:
cd apps
  • Run the command and a menu will be shown:
bash run.sh

alt text

Help

Any advise for common problems or issues.

Try to restart the containers. If WebUI of Hadoop or Spark is not showing at first place, just wait a few seconds in order to finish the deployment.

Authors

Creator:

ex. Christina Manara (christinamanara2@gmail.com)

Version History

  • 0.1
    • Initial Release

Acknowledgments

Inspiration, code snippets, etc.