Trained a classifier to recognize 3000 images with 15 categories using Bag of Features model and Spatial Pyramid Matching algorithm. Improved accuracy from ~50% to ~70%.
All common python packages are needed (Numpy, Matplotlib,...). We also need OpenCV in this project.
pip install numpy matplotlib opencv-python
We will need to use SIFT built-int function of OpenCV to extract SIFT or SURF features from images, and since version OpenCV 3.x.x no longer includes these functions, one simple way is to downgrade to version 2.x.x. So first uninstall old version if already exists:
pip uninstall opencv-python
Then install this version:
pip install opencv-contrib-python
Now we should be good to use SIFT/SURF descriptor in OpenCV.
Note that we can also use other features descriptor like HOG instead of SIFT.
- Try classify on raw features (accuracy ~18% - 25%)
- Build a SIFT descriptor by constructing histogram of frequencies of "visual words". We find SIFT features for each image which is a 1D vector of size 128, then concatenate all these 1D vectors into a long 1D vector of the whole training set. Use K-means to cluster these data points from this vector into K groups. Now for each image, we have its SIFT features, assign these features into clusters that we've already clustered before, this will represent the histogram representation of frequencies of "visual words" for our image. Do this similarly for the rest of the images to form the features representation of our data before feeding into the classifier model. First, we will try K nearest neighbor. (accuracy ~50% - 60%)
- Build the SIFT histogram representation of "visual words" similarly as mentioned above, now we'll try to use multiclass Support Vector Machines as our classification model and compare the result. (accuracy ~60% - 70%)
Here is the intuition for constructing SIFT descriptor:
Finally, concatenate the 16 histograms together to get the final 128-element SIFT descriptor.
One drawback of Bag of Visual Words is, all local features are encoded into a single code vector ignoring the position of the feature descriptors, which means spatial information between words are discarded in the final code vector. Thus, to incorporate the spatial information into the final code vector, we can apply Spatial Pyramid Matching, a very simple but powerful idea proposed in Lazebnik et al. 2006.