Extract Scraped Wikipedia Data Stored as a JSON, and Kaggle Data Stored in CSVs, Transform and Load Into Structured Data using SQL
Goals • Dataset • Tools Used
Our goal is to gather Movie data from both Wikipedia and Kaggle in order to build a structured Movie dataset. We will use Python and Pandas to explore, document, combine both data sources, and perform our data transformation. Finally, after the data is transformed into a consistent structure, it's loaded into the data target. We'll be loading our data into a PostgreSQL table.
High level explanation of data source
- Wikipedia Movies: JSON file containing metadata for 6,075 movies, extracted from Wikipedia
- Kaggle Movie Ratings: Compressed CSV containing metadata for 45,466 movies, downloaded from Kaggle.com
- Python: Programming language used to build automated auditing solution
- Pandas: Open source Python library providing high performance analysis tools
- Numpy: Open source Python library used for advanced scientific computing
- PostgreSQL: Software used to build databases and analyze data with SQL
- SQL: Structured Query Language, used to query databases and quickly analyze structured data