- Set up a Spark cluster on AWS Elastic MapReduce (EMR) for advanced analysis of IMDB datasets from Kaggle.
- Configured AWS infrastructure including IAM user creation and policy attachment for secure and efficient data processing.
- Employed AWS EMR Studio integrated with the Spark cluster for a streamlined analysis environment.
- Utilized PySpark in a Jupyter Notebook for data exploration, cleaning, and loading into Spark DataFrames.
- Executed advanced PySpark queries and leveraged SQL queries within Spark to analyze various aspects of the movie industry extract detailed insights
AlessandroSciorilli/IMDB_analysis_AWS_PySpark
Analysis of IMDB Datasets with AWS and PySpark
Jupyter Notebook