/IMDB_analysis_AWS_PySpark

Analysis of IMDB Datasets with AWS and PySpark

Primary LanguageJupyter Notebook

Analyzing IMDB Datasets with AWS Spark Cluster and PySpark, 2023

  • Set up a Spark cluster on AWS Elastic MapReduce (EMR) for advanced analysis of IMDB datasets from Kaggle.
  • Configured AWS infrastructure including IAM user creation and policy attachment for secure and efficient data processing.
  • Employed AWS EMR Studio integrated with the Spark cluster for a streamlined analysis environment.
  • Utilized PySpark in a Jupyter Notebook for data exploration, cleaning, and loading into Spark DataFrames.
  • Executed advanced PySpark queries and leveraged SQL queries within Spark to analyze various aspects of the movie industry extract detailed insights