WikiArt Scraper

This repository contains scrapers developed for wikiart.org. The scraper is a part of the project Art Guide undertaken in Practicing DS Skills in ML Competitions and Building ML-powered Applications classes.

For our project, we required comprehensive metadata about art pieces, such as genres, styles, and other descriptors which were not present in other datasets I found. Thus, these scrapers are designed to extract all tabular information about Art Pieces, Artists, Art Movements, Schools and Styles.
present on the website.

Overview

The project consists of 5 crawlers:

wikiart spider: This crawler extracts comprehensive details and images of various art pieces from the WikiArt website.
wikiart artists spider: This crawler specializes in gathering information about artists.
wikiart styles spider: This crawler is focused on collecting extensive information about different art styles.
wikiart movements spider: This crawler delves into the world of art movements.
wikiart schools spider: This crawler concentrates on gathering comprehensive data about art schools.

In addition to the primary crawlers, the project includes DuckDuckGo spiders for updating descriptions in specific categories:

duck_duck_go.py: Updates descriptions for art pieces.
duck_duck_go_artist.py: Updates information about artists.
duck_duck_go_style.py: Updates information about art styles.
duck_duck_go_movement.py: Updates information about art movements.
duck_duck_go_school.py: Updates information about art schools.

These DuckDuckGo spiders enhance and maintain the data integrity by fetching updated information for paintings, artists, styles, movements, and schools based on the existing datasets.

Scraped Information for Artworks:

URL
Title
Original Title
Author
Author Link
Date
Styles
Series
Series Link
Genre
Genre Link
Media
Location
Dimensions
Description
Wiki Description
Wiki Link
Tags
Image URLs
Images

Scraped Information about Artists:

URL
Name
Original Name
Birth Date
Birthplace
Death Date
Death Place
Active Years
Nationality
Art Movements
Painting School
Genres
Fields
Influenced On
Influenced By
Teachers
Pupils
Art Institutions
Friends And Coworkers
Description
Wiki Description
Wikipedia Link

Scraped Information for Art Styles:

Name
Link
Description

Scraped Information for Art Movements:

Name
Link
Description

Scraped Information for Art Schools:

Name
Link
Description

The main objective is to extract detailed data about art pieces and artists from the website, providing valuable datasets for data science and machine learning endeavors.

Scraping of 191265 images took ~14 hours on a MacBook Pro (Retina, 15-inch, Mid 2015, 2,2 GHz Quad-Core Intel Core i7). Scraping of 3521 artists took less than 10 minutes

Prerequisites

Python 3.x (3.10 is verified)
Scrapy

Installation

Clone this repository: git clone https://github.com/michaelvin1322/scrapWikiArt
Navigate to the repository and install the required packages:

cd ScrapWikiArt
pip install -r requirements.txt

Crawler	Command
Art Pieces Crawler	`scrapy runspider -o data/data.csv -t csv ScrapWikiArt/spiders/wikiart.py`
Artists Crawler	`scrapy runspider -o data/artists.csv -t csv ScrapWikiArt/spiders/wikiart_artist.py`
Styles Crawler	`scrapy runspider -o data/styles.csv -t csv ScrapWikiArt/spiders/wikiart_style.py`
Movements Crawler	`scrapy runspider -o data/movements.csv -t csv ScrapWikiArt/spiders/wikiart_movement.py`
Schools Crawler	`scrapy runspider -o data/schools.csv -t csv ScrapWikiArt/spiders/wikiart_school.py`
DuckDuckGo Crawler	`scrapy runspider -o data/data_update.csv -t csv -a input_file=data/data.csv ScrapWikiArt/spiders/duck_duck_go.py`
DuckDuckGo Artist Spider	`scrapy runspider -o data/artist_update.csv -t csv -a input_file=data/artists.csv ScrapWikiArt/spiders/duck_duck_go_artist.py`
DuckDuckGo Styles Spider	`scrapy runspider -o data/styles_update.csv -t csv -a input_file=data/styles.csv ScrapWikiArt/spiders/duck_duck_go_style.py`
DuckDuckGo Movements Spider	`scrapy runspider -o data/movements_update.csv -t csv -a input_file=data/movements.csv ScrapWikiArt/spiders/duck_duck_go_movement.py`
DuckDuckGo Schools Spider	`scrapy runspider -o data/schools_update.csv -t csv -a input_file=data/schools.csv ScrapWikiArt/spiders/duck_duck_go_school.py`

Output

Art Pieces Crawler

By default, images will be downloaded into the data/img directory and data will be saved in data/data.csv.

Images folder may be changed in settings.py by changing path in IMAGES_STORE.

Artists Crawler

By default, data will be saved in data/artists.csv.

Styles Crawler

By default, data will be saved in data/styles.csv.

Movements Crawler

By default, data will be saved in data/movements.csv.

Schools Crawler

By default, data will be saved in data/schools.csv.

michaelvin1322/scrapWikiArt