
Implementation of Vector of Locally Aggregated Descriptors (VLAD) ( reimplementing for monocular visual slam)

Vector of Locally Aggregated Descriptors (VLAD)


This repository is an implementation of VLAD, which was originally formulated by Hervé Jégou in [1]. The implementation is part of my bachelors-thesis termed "Computer Vision and Machine Learning for marker-free product identification".

VLAD is an algorithm that allows a user to aggregate local descriptors into a compact, global representation. It's derived from the Bag of Features approach [2] and related to the Fisher vector [3].

This repository is a WIP and should not be considered production-ready. I took a bit of inspiration from Jorjassos implementation and tried to make it better. Improved versions of the original formulation are also implemented. I took them from [1, 4, 5] and inserted references in the code.


  • Numpy
  • Scikit-Learn
  • Progressbar2
  • OpenCV (for the examples)




The API is based on the wonderful Scikit-Learn API, which uses the basic notion of fit/ predict/ transform.

To include VLAD in the current file just write:

from vlad import VLAD

On initialization, the number of visual words (k) and the norming-scheme are given. Norming is a crucial difference in between different implementations [1, 4, 5] with [5] containing the preferable one. To instantiate a VLAD-object write:

vlad = VLAD(k=16, norming="RN")  # Defaults are k=256 and norming="original"

After having instantiated the object you can fit the visual vocabulary with:


The fit-function also returns the instance (again, sklearn-style), so the following two are equivalent:

vlad = VLAD(k=16, norming="RN")
# ...
vlad = VLAD(k=16, norming="RN").fit(X)

X is a tensor of image-descriptors (m x d x n), where m is the number of descriptors per image, d is the number of dimensions per descriptor and n is the total number of image-descriptors. It's best to use image-descriptors in euclidean space (Such as SIFT or RootSIFT [6]), rather than in hamming space, as the KMeans-clustering won't work properly with hamming-descriptors.

Whenever a visual dictionary is fitted, the dictionary is saved to disc and can be loaded manually to bypass training.

To check for an image one can write:

vlad.predict(imdesc)  # imdesc is a (m x d) descriptor-matrix

to get the image-index with maximum similarity. Alternatively


can be used to obtain a Numpy-array with all similarity scores.

If you want to work with the VLAD-descriptors outside of the class, the transform- and fit_transform-functions can be utilized:

vlads = vlad.transform(descriptor_tensor)  # Call on fitted model

vlads = vlad.fit_transform(descriptor_tensor)  # Can be called on a non-fitted model


Documentation can be found at [...] TODO


Task Status
Original formulation (SSR, L2) [1] Done
Use RootSIFT-descriptors [6] Done
Try with more descriptors TODO
Try with dense descriptors TODO
Intra-Normalization [4] Done
Residual-Normalization (RN) [5] Done
Local Coordinate System (LCS) [5] Done
Dimensionality-Reduction [7,8] TODO
Quantization [9] Done
Generalization using multiple vocabularies [7] TODO
Make documentation TODO
Include Tests TODO
Include Install-Instructions TODO
Include Usage-Examples Done
Provide example notebooks Done


