/BigData-HW-Spark

Spark exercises: Spark RDD, SparkSQL, Spark ML pipelines, Spark in Cloud(AWS)

Primary LanguagePython

BigData-HW-Spark

This repository contains solutions for four Spark exercises.

  1. SparkSQL
  2. Spark RDD
  3. Spark DataFrame and Machine Learning Pipelines -- Gradient Boosted Tree
  4. Spark Application -- Crime Analysis
  5. Spark Application -- Profit Prediction

Directory structure

├── README.md                               <- You are here
├── SparkSQL
│   ├── exercise1.py                        <- python source code file
│   ├── exercise1.png                       <- Output of the Spark Job
│   ├── exercise1-findings.txt              <- Findings
│   ├── Problem_Statement.md                <- Problem Statement
├── SparkRDD
│   ├── exercise2.py                        <- python source code file
│   ├── exercise2.txt                       <- Output of the Spark Job
│   ├── exercise2-findings.txt              <- Findings
│   ├── Problem_Statement.md                <- Problem Statement
├── Spark_Machine_Learning_Pipeline
│   ├── exercise3.py                        <- python source code file
│   ├── exercise3.txt                       <- Output of the Spark Job: Out of sample R Square of the Model
│   ├── Problem_Statement.md                <- Problem Statement
├── Spark_Application_Crime_Analysis
│   ├── exercise4.py                        <- python source code file
│   ├── exercise4.txt                       <- Output of the Spark Job
│   ├── exercise4.png                       <- Output of the Spark Job
│   ├── exercise3-findings.txt              <- Findings
│   ├── Problem_Statement.md                <- Problem Statement
├── Spark_Application_Profit_Prediction
│   ├── exercise5.py                        <- python source code file
│   ├── mape_all.txt                       <- Output of the Spark Job
│   ├── Problem_Statement.md                <- Problem Statement
<!-- tocstop -->