alexklibisz/elastiknn

Try caching the query vector's FloatVector segments when computing distance

alexklibisz opened this issue · 2 comments

Background

Here is the l2Distance method from PanamaFloatVectorOps:

public double l2Distance(float[] v1, float[] v2) {
double sumSqrDiff = 0f;
int i = 0;
int bound = species.loopBound(v1.length);
FloatVector pv1, pv2, pv3;
for (; i < bound; i+= species.length()) {
pv1 = FloatVector.fromArray(species, v1, i);
pv2 = FloatVector.fromArray(species, v2, i);
pv3 = pv1.sub(pv2);
// For some unknown reason, pv3.mul(pv3) is significantly faster than pv3.pow(2).
sumSqrDiff += pv3.mul(pv3).reduceLanes(VectorOperators.ADD);
}
for (; i < v1.length; i++) {
float diff = v1[i] - v2[i];
sumSqrDiff += diff * diff;
}
return Math.sqrt(sumSqrDiff);
}
}
)

Note how lines 77 and 78 are instantiating a FloatVector from an array. We do this in several iterations because the FloatVectors need to be a specific length in order to leverage the SIMD operations.

v1 and pv1 correspond to the query vector. v2 and pv2 correspond to the vector stored in the index. In a typical query, we compute l2Distance for exactly one query vector against many stored vectors.

So, we should be able to memoize or cache an array of segments for v1 and then avoid re-copying v1 into pv1, i.e., avoid line 77 in the method.

Unclear if this will have any positive effect, but it's worth a quick try.

Deliverables

  • Try caching/memoizing the segments of v1 to avoid re-computing them.

Related Issues

No response

I took a first pass at implementing this in #605. The benchmark looks promising on my M1 Macbook, but not so much on my EC2 r6i.8xlarge instance.

On M1 Macbook:

[info] Benchmark                            Mode  Cnt        Score       Error  Units
[info] CachedL2DistanceBenchmark.baseline  thrpt    6  2584985.052 ±  9945.960  ops/s
[info] CachedL2DistanceBenchmark.cached    thrpt    6  4461915.029 ± 43007.018  ops/s

On EC2 r6i.8xlarge:

[info] Benchmark                            Mode  Cnt        Score        Error  Units
[info] CachedL2DistanceBenchmark.baseline  thrpt    6  2939742.574 ± 273069.456  ops/s
[info] CachedL2DistanceBenchmark.cached    thrpt    6  2698700.860 ±  13063.707  ops/s

It seems on the M1, we actually benefit from pre-computing and fetching the chunks from memory. Not so much on standard x86 architecture on the EC2 instance.

Closing as I'm skeptical that this would work.