This project aims to classify fine-art paintings by their corresponding artists and styles.
We obtained the data from a Kaggle Competition (Painters by Numbers: https://www.kaggle.com/c/painter-by-numbers). The dataset consists of over 100,000 paintings and has labeled each painting with its respective artist, genre, style and date of creation.
-
The folder Data consists of the 2 links that redirects you to the AWS S3 bucket. We stored the features extracted from VGG16 and ResNet50 in this bucket as the data cannot be stored on Github (Over 60 GB of Data). We have stored the data on Google Drive as well.
-
The folder FeatureExtracting consists of the code to extract features from the images via CNNs. The main code is in
create_image_feature.py
. You can find a series of the function named "gen_model_name()" at the bottom of the file. This function can be used directly to generate image features. The meaning of parameter:
- Output: The folder used to receive generated features. The function would automatically generate a subfolder with timestamps under the output.
- image_size: Input size of CNN module. The default size is 224 × 224. Images that do not follow the size will be resized automatically.
- batch_size: This parameter represents how many image features will be grouped in a small batch file. The size is dependent on the module and computer memory. Recommend value is 50 ~ 200.
- Job: This parameter shows how many jobs would work in parallel. It should be not larger than the CPU core number. If the value is identical to the CPU core number means the computer is, this task will occupy all computational resources.
-
The folder Classification consists of the main py script that is used to classify paintings into Styles and Artists using XGBoost. The Python Script
artist_classification.py
will load the ResNet50 and VGG16 feature data for artists, split the data into training and testing, train the model using XGBoost, and then run the predictions on the testing data.style_classification.py
will do similar functionalities asartist_classification.py
but for the styles data. The Python scriptplotting_metrics.py
is used to plot the Confusion Matrix and the ROC Curves. -
The folder Data Preprocessing walks through the Data Preprocessing steps we computed for the project.
-
The folder notebook contains rough working of the model training and testing. (It is not the main code in the repo. Just for Reference for the team).
Download all data from S3 in the same repository. (Including the Images for feature extraction)
- Feature Extraction
python create_image_feature.py
- Artist Classification
python artist_classification.py
- Style Classification
python style_classification.py
Upon reading research papers, we decided to implement a Hybrid CNN-XGBoost Model where the CNN model would extract the painting features and the XGBoost Classifier would classify the paintings into the respective artist/style. Research Papers indicated that this hybrid model is 1) Computationally less expensive and 2) Produce similar/even better results than the original CNN model and hence we decided to use this model.
We decided to use two different CNN networks, namely the VGG-16 and ResNet-50 to extract image features and XGBoost for classification. The architectures are shown below.
The styles we chose for this classification model ranged from the 1400s to 2000s. They were:
Painting Style | Time Period |
---|---|
Renaissance | 1400-1600 |
Baroque | 1600-1750 |
Romanticism | 1800-1850 |
Realism | 1850-1860 |
Impressionism | 1860-1870 |
Art Nouveau | 1880-1910 |
Expressionalism | 1905-1920 |
Surrealism | 1910-1920 |
Cubism | 1900-1920 |
Abstract Art | 1940+ |
The artists we chose for this classification model were:
- Ivan Aivazovsky
- Marc Chagall
- Camille Pissarro
- Albrecht Durer
- Vincent Van Gogh
- Paul Cezanne
- Martiros Saryan
- Ivan Shishkin
- Gustave Dore
- Pierre-Auguste Renoir
- Rembrandt
- Pablo Picasso
We chose these artists as they had a minimum of 500 paintings and represented the 10 styles above.