This project focuses on exploratory data analysis (EDA) and visualization of a movie dataset using PySpark. The project aims to answer specific questions related to the dataset, providing insights into various aspects of the movies it contains.
-
Column Composition:
- What columns are present in the loaded datasets?
-
Number of Movies:
- How many movies are included in the provided dataset?
-
Number of Users:
- How many users have provided ratings in the dataset?
-
Missing Data:
- Are there any missing values in the dataset?
-
Movies without Ratings:
- How many movies lack ratings, and which ones are they?
-
Best-Rated Movie:
- Which movie has the highest average rating? In case of ties, consider the one with the most votes.
-
Percentage of Top-Rated Movies:
- What percentage of movies have only maximum ratings?
-
Movie with Highest Minimum Rating:
- Which movie has the highest minimum rating? In case of ties, consider the one with the most votes.
-
Distribution of Ratings:
- What is the distribution of ratings?
-
Documentary Films:
- How many movies are classified as 'documentary'?
-
Best-Rated Documentary with 10+ Votes:
- Which documentary movie with at least 10 votes has the highest average rating?
-
Yearly Movie Count Differences:
- What are the differences in the number of movies each year? Assume the timestamp represents seconds since 1960.
-
Average Categories per Movie:
- What is the average number of categories assigned to a movie? Which movie has the most categories, and what are they?
Feel free to explore the code, adapt it to other datasets, and enhance the analysis as needed. Happy exploring!