/Movie-DB-ETL_Python_SQL

Extract Scraped Wikipedia Data Stored as a JSON, and Kaggle Data Stored in CSVs, Transform and Load Into Structured Database using SQL

Primary LanguageJupyter Notebook

Cleaning Movie Metadata with Python and SQL

Extract Scraped Wikipedia Data Stored as a JSON, and Kaggle Data Stored in CSVs, Transform and Load Into Structured Data using SQL

Goals  •  Dataset  •  Tools Used

Goals

Our goal is to gather Movie data from both Wikipedia and Kaggle in order to build a structured Movie dataset. We will use Python and Pandas to explore, document, combine both data sources, and perform our data transformation. Finally, after the data is transformed into a consistent structure, it's loaded into the data target. We'll be loading our data into a PostgreSQL table.

Dataset

High level explanation of data source

  • Wikipedia Movies: JSON file containing metadata for 6,075 movies, extracted from Wikipedia
  • Kaggle Movie Ratings: Compressed CSV containing metadata for 45,466 movies, downloaded from Kaggle.com

Tools Used

  • Python: Programming language used to build automated auditing solution
    • Pandas: Open source Python library providing high performance analysis tools
    • Numpy: Open source Python library used for advanced scientific computing
  • PostgreSQL: Software used to build databases and analyze data with SQL
  • SQL: Structured Query Language, used to query databases and quickly analyze structured data

Back to top