Help with PCA/Spark docs
rcompton opened this issue · 5 comments
In the LOPQ Training section at https://github.com/yahoo/lopq/blob/master/spark/README.md I see:
Here is an example of training a full model from scratch and saving the model parameters as both a pickle file and a protobuf file:
spark-submit train_model.py \
--data /hdfs/path/to/data \
--V 16 \
--M 8 \
--model_pkl /hdfs/output/path/model.pkl \
--model_proto /hdfs/output/path/model.lopq
But above that, in the PCA Training section, I see:
A necessary preprocessing step for training is to PCA and variance balance the raw data vectors to produce the LOPQ data vectors, i.e. the vectors that LOPQ will quantize.
It's not clear to me how the outputs of the train_pca.py
script are supposed to feed into the train_model.py
script. Am I supposed to use the results of train_pca.py
to do the variance balancing myself and then feed that into train_model.py
or does the "training a full model from scratch" take care of that step for me?
Hi Ryan,
I think the answer to you question is exactly the
https://github.com/yahoo/lopq/blob/master/spark/pca_preparation.py script. This script illustrates how to prepare PCA parameters before using them in the LOPQ pipeline.
The eigenvalue_allocation function does the balancing in as many subspaces as you need, in the script it is set to 2 for multi-index
I agree that the documentation could be clearer about this.
To be clear, prior to training and prior to search it's recommended that I apply:
def apply_PCA(x, mu, P):
"""
Example of applying PCA.
"""
return np.dot(x - mu, P)
to each vector, where P
has already undergone permuted_inds = eigenvalue_allocation(2, E)
?
Is this global PCA step in the paper?
Yes, you are right!
This is not explained (nor used) in the CVPR paper. Ge et al. mention it in the Optimized Product Quantization PAMI paper. The effect is small for SIFT features, as they are in general balanced in terms of variance from the start in these datasets.
If you want to use LOPQ with dimensionality reduced CNN features however (that was shown to be good practice eg like in the "Neural Codes for Image Retrieval" paper by Babenko), a permutation of dimensions with eigenvector_allocation gives a big boost in performance.
Great, thanks!