Open Library data cleanup case study Open Library is an initiative of the Internet Archive, a 501(c)(3) non-profit, building a digital library of Internet sites and other cultural artifacts in digital form. In the section Bulk Data Dumps, they provide public feeds with the library data. à https://openlibrary.org/developers/dumps They also provide a shorter versions of the file for developing or exploratory purposes, where the size is around 140MB of data instead of ~20GB of the original/full file (referring to the “complete dump”).
à https://s3-eu-west-1.amazonaws.com/csparkdata/ol_cdump.json
Starting with the short version of this file, pls. download it to your local laptop:
wget --continue https://s3-eu-west-1.amazonaws.com/csparkdata/ol_cdump.json -O /tmp/ol_cdump.json
Please use the JSON file to provide the following information.
- Load the data
- Make sure your data set is cleaned enough, so we for example don't include in results with empty/null "titles" and/or "number of pages" is greater than 20 and "publishing year" is after 1950. State your filters clearly.
- Run the following queries with the preprocessed/cleaned dataset:
- Select all "Harry Potter" books
- Get the book with the most pages
- Find the Top 5 authors with most written books (assuming author in first position in the array, "key" field and each row is a different book)
- Find the Top 5 genres with most books
- Get the avg. number of pages
- Per publish year, get the number of authors that published at least one book