/pyspark-training

Some exercize with pyspark

Primary LanguagePython

DATA ANALYTICS WITH APACHE SPARK

Final exercise of the course "Data Analytics with Apache Spark" by Gaetano Fabiano

Getting Started

  1. Download IMDB Spoiler Dataset from Kaggle from https://www.kaggle.com/rmisra/imdb-spoiler-dataset
  2. Download IMDb Dataset from https://www.kaggle.com/ashirwadsangwan/imdb-dataset
  3. Move file in folder "in"

Prerequisites

  • Python
  • Pyspark

Running

  • The script main.py extract in a txt file the words that are most used in films which have a rating greater than mean value and with number of rating greater than 100

  • The script main_part_two.py apply k-means clustering to the films based on rating.