Segmentation fault
godkillok opened this issue ยท 10 comments
Segmentation fault
Running on:
- [v] CPU
Interface:
-
[ v] Python
training_vectors.shape is (2357720, 100). Training is done, but when go to search< index.search(training_vectors[0:10000], 100) > , it always report "Segmentation fault".
the code as follow:(num, d) = training_vectors.shape
t1 = time.time()nlist = max(5,int(num/500))
normalize_L2(training_vectors)
quantizer=faiss.IndexFlatIP(d)index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_INNER_PRODUCT)
index.train(training_vectors)
index.nprobe =max(1,int(nlist*0.6)) # default nprobe is 1, try a few more
index.add(training_vectors)
t2 = time.time()
logging.info('{} times is {}'.format('add and train', t2 - t1))t1=time.time()
score, sim_id=index.search(training_vectors[0:10000], 100) # this line goes wrong
t2=time.time()
I don't see an obvious reason why this could go wrong.
Does the segfault still occur when searching 1000 vectors instead of 100000?
I also get a segmentation fault (also using cpu).
This happens only if the faiss index is a member of an object, for instance here:
import faiss
import numpy as np
class Products:
def __init__(self, path):
self.embeddings = np.load(f'{path}/embeddings.npy') # this loads a ~ 100000x512 float32 array
quantizer = faiss.IndexFlatIP(512)
self.index = faiss.IndexIVFFlat(quantizer, 512, 100, faiss.METRIC_L2)
self.index.train(self.embeddings)
self.index.add(self.embeddings)
def find_nearest(self, index, n):
return self.index.search(self.embeddings[index].reshape(1, -1), n)
p = Products('path/to/the/npy')
p.find_nearest(100, 10) # segfault happens here
When implementing the same without a class, there is no segmentation fault:
import faiss
import numpy as np
path = 'path/to/the/npy'
embeddings = np.load(f'{path}/embeddings.npy') # this loads a ~ 100000x512 float32 array
quantizer = faiss.IndexFlatIP(512)
index = faiss.IndexIVFFlat(quantizer, 512, 100, faiss.METRIC_L2)
index.train(embeddings)
index.add(embeddings)
def find_nearest(i, n):
return index.search(embeddings[i].reshape(1, -1), n)
find_nearest(100, 10) # no segfault, works as expected
@wuhu This is because the quantizer gets garbage-collected by python. You could do self.quantizer = ...
instead, so that it is not GCed before your Products
instance is destroyed.
This seems to be another case of crashes in Python code. That makes me wonder: what would be the consequences of having the bindings automatically keep references to nested indexes on construction of new indexes (as in, apply the dont_dealloc_me
trick)? I would imagine this to be safer and more predictable than requiring users to keep references by themselves in Python-land (although I probably overlooked something).
It is on our TODO list. It requires to do some SWIG trick that adds that reference when an object is passed in as a reference to a few dozen functions and constructors.
The reason why we have not done it yet is:
- I don't know exactly where in SWIG this has to be done
- this problem hits only mid-level users. Low-level users know well enough the library and how the references work, and high-level users only use the index_factory that does not have this problem.
sorry, i reply so late. i double checked my code, and i found the same problem as @wuhu, not caused by the line i mentioned
@godkillok Great โ can we close this issue then?
It should be documented at least somewhere while the automatic reference counting is not implemented yet.