ACS PUMS 2013-2017 Knowledge Discovery on Spark Architecture. This is a part of AMOD 5410 course project. The problem to solve is to carry predictive analytics on the census data fo over 10M records. While the emphasis is not on getting the best accuracy, its on leveraging the spark architecture and core API using python (pyspark) to understand different constructs in spark and most importantly the working via the DAG's.
Used spark standlone cluster locally with a system 4 cores and 16GB RAM, running centOS.
What things you need to install the software and how to install them
- Download spark standalone
wget 'https://www.apache.org/dyn/closer.lua/spark/spark-2.4.5/spark-2.4.5-bin-hadoop2.7.tgz'
- Untar the compiled spark version
tar xzvf spark-2.4.5-bin-hadoop2.7.tgz
Install the packagaes and modules to work on which don't come along with pyspark
- For python modules - pip
sudo yum install python-pip
- For visualizations
pip install python-dist-explore
python <fileName.py>
- Apache Spark - Large scale data processing
- Microsoft Azure - Virtual Machine
- Mohammed Khursheed Ali Khan
This project is licensed under the MIT License - see the LICENSE.md file for details
- Spark Documentation
- Medium
- TowardsDataScience