Movie-Dataset-Analysis-using-PySpark

Project Overview

This project focuses on exploratory data analysis (EDA) and visualization of a movie dataset using PySpark. The project aims to answer specific questions related to the dataset, providing insights into various aspects of the movies it contains.

Questions Addressed:

Column Composition:
- What columns are present in the loaded datasets?
Number of Movies:
- How many movies are included in the provided dataset?
Number of Users:
- How many users have provided ratings in the dataset?
Missing Data:
- Are there any missing values in the dataset?
Movies without Ratings:
- How many movies lack ratings, and which ones are they?
Best-Rated Movie:
- Which movie has the highest average rating? In case of ties, consider the one with the most votes.
Percentage of Top-Rated Movies:
- What percentage of movies have only maximum ratings?
Movie with Highest Minimum Rating:
- Which movie has the highest minimum rating? In case of ties, consider the one with the most votes.
Distribution of Ratings:
- What is the distribution of ratings?
Documentary Films:
- How many movies are classified as 'documentary'?
Best-Rated Documentary with 10+ Votes:
- Which documentary movie with at least 10 votes has the highest average rating?
Yearly Movie Count Differences:
- What are the differences in the number of movies each year? Assume the timestamp represents seconds since 1960.
Average Categories per Movie:
- What is the average number of categories assigned to a movie? Which movie has the most categories, and what are they?

Feel free to explore the code, adapt it to other datasets, and enhance the analysis as needed. Happy exploring!

zuzannapiekarczyk/Movie-Dataset-Analysis-using-PySpark

Movie-Dataset-Analysis-using-PySpark

Project Overview

Questions Addressed: