mugnaidaniele/pyspark-training

Some exercize with pyspark

Python

DATA ANALYTICS WITH APACHE SPARK

Final exercise of the course "Data Analytics with Apache Spark" by Gaetano Fabiano

Getting Started

Download IMDB Spoiler Dataset from Kaggle from https://www.kaggle.com/rmisra/imdb-spoiler-dataset
Download IMDb Dataset from https://www.kaggle.com/ashirwadsangwan/imdb-dataset
Move file in folder "in"

Prerequisites

Python
Pyspark

Running

The script main.py extract in a txt file the words that are most used in films which have a rating greater than mean value and with number of rating greater than 100
The script main_part_two.py apply k-means clustering to the films based on rating.