/prado-downloader

Scrape data and images from the PRADO MUSEUM website to build a dataset for a Generative Adversarial Network

Primary LanguagePythonMIT LicenseMIT

Prado Museum's Project

Scrape data and images from the PRADO MUSEUM website to build a dataset for a Generative Adversarial Network.

Website: https://www.museodelprado.es/coleccion/obras-de-arte

Steps:

  1. Get list of work URLs

Download in ascending and descending order to overcome the 10,000 limit of the pagination (...normalize/canonicalize URLs and remove duplicates)

sh get_pages.sh
  1. Extract work URLs
python parse_pages.py
  1. Download HTML of the URLs
sh get_works.sh
  1. Parse downloaded HTMLs
python parse_works.py