Cleaning Movie Metadata with Python and SQL

Extract Scraped Wikipedia Data Stored as a JSON, and Kaggle Data Stored in CSVs, Transform and Load Into Structured Data using SQL

Goals

Our goal is to gather Movie data from both Wikipedia and Kaggle in order to build a structured Movie dataset. We will use Python and Pandas to explore, document, combine both data sources, and perform our data transformation. Finally, after the data is transformed into a consistent structure, it's loaded into the data target. We'll be loading our data into a PostgreSQL table.

Dataset

High level explanation of data source

Wikipedia Movies: JSON file containing metadata for 6,075 movies, extracted from Wikipedia
Kaggle Movie Ratings: Compressed CSV containing metadata for 45,466 movies, downloaded from Kaggle.com

Tools Used

Python: Programming language used to build automated auditing solution
- Pandas: Open source Python library providing high performance analysis tools
- Numpy: Open source Python library used for advanced scientific computing
PostgreSQL: Software used to build databases and analyze data with SQL
SQL: Structured Query Language, used to query databases and quickly analyze structured data

rivas-j/Movie-DB-ETL_Python_SQL

Cleaning Movie Metadata with Python and SQL

Extract Scraped Wikipedia Data Stored as a JSON, and Kaggle Data Stored in CSVs, Transform and Load Into Structured Data using SQL

Goals

Dataset

Tools Used