Scrape data and images from the PRADO MUSEUM website to build a dataset for a Generative Adversarial Network.
Website: https://www.museodelprado.es/coleccion/obras-de-arte
Steps:
- Get list of work URLs
Download in ascending and descending order to overcome the 10,000 limit of the pagination (...normalize/canonicalize URLs and remove duplicates)
sh get_pages.sh
- Extract work URLs
python parse_pages.py
- Download HTML of the URLs
sh get_works.sh
- Parse downloaded HTMLs
python parse_works.py