AVA-Scraper


What is AVA?

AVA: A Large-Scale Database for Aesthetic Visual Analysis

The AVA dataset was released in 2012 for conducting research on image aesthetics. The dataset consists of over 250,000 images. The images consists of number of votes for ratings (1-10), semantic tags, and the challenges it is associated to.
PAPER: AVA: A Large-Scale Database for Aesthetic Visual Analysis

Later in 2016, the comments (labeled as AVA-Comments) were released for the respective images and consists of over 1.5 million comments.
PAPER: Joint Image and Text Representation for Aesthetics Analysis

Both comments and images were taken from dpchallenge.com.


What is AVA-Scraper?

Images after 2012 are not scraped and are still available on dpchallenge.com. This scraper is used to download the images, comments and other metadata. It is divided into three parts:

  • Image scraper: Used for extracting images, their ratings, and number of votes. Stored as IMAGE_ID.jpg

  • Comment scraper: Used for extracting comments from images, some text cleaning (example: removal of URLs, carraige returns character, etc) and storage. Stored as IMAGE_ID.txt, with one line per comment.

  • Others: Used for extracting new challenges and existing rules

The new data is stored under the name of AVA 2.0

How does it work?

Scraping takes place in the following order:

  1. New challenges are extracted from dpchallenge and stops when the last challenge from the first AVA dataset is reached.

  2. Going one challenge at a time, images are extracted along with their ratings and votes.

  3. The IDs of each extracted image is saved. Then, looping each image at a time, the comments and semantic tags are extracted.

As of 11th August, 2017: 81,986 new images have been extracted. This only includes images WITH ratings.

NOTE: There is always a delay of 60s, as per the requirements in robots.txt. If you get blocked, it will take around a week to get unbarred.

An emergency function has been added in case of issues, such as the site's server inactive or loss of internet connection. The function will carry on scraping from where it left off :)