RII with Billion scale dataset code configuration/reconfiguration crashes the kernel
Opened this issue · 3 comments
I have a RII object with 3.3 billion scale dataset which was batch loaded using the add(update_posting_lists=False)
and then at the end I ran the reconfigure()
, but it crashes the python kernel in the reconfigure step. I tried adding and then configuring it immediateIy and it was taking forever, didn't know if it worked or just hung. I was looking at the code and saw some comments about large memory consumption, is there a alternate way to do this without crashing?
Let me know the minimum code to reproduce the error.
import rii
import pickle
import numpy as np
import nanopq
N, Nt, D = 3_300_000_000, 660_000, 75
#MemoryError: Unable to allocate 1.80 TiB for an array with shape (3_300_000_000, 75) and data type float64
# X = np.random.random((N, D)).astype(np.float32) # 3_300_000_000 75-dim vectors to be searched
Xt = np.random.random((Nt, D)).astype(np.float32) # 660_000 75-dim vectors for training
q = np.random.random((D,)).astype(np.float32) # a 75-dim vector
# Prepare a PQ/OPQ codec with M=5 sub spaces
codec = nanopq.PQ(M=5).fit(vecs=Xt) # Trained using Xt
# Instantiate a Rii class with the codec
e = rii.Rii(fine_quantizer=codec)
# Batch Add vectors - 1_000_000 x 3_300 times = 3.3 Billion
# In reality data is loaded from Parquet files
for i in range(3_300):
X = np.random.random((1_000_000, D)).astype(np.float32)
e.add(vecs=X, update_posting_lists=False)
# e.reconfigure() ## takes longer and longer as the loop advances
e.reconfigure() # Crashes
# Search
ids, dists = e.query(q=q, topk=3)
print(ids, dists) # e.g., [7484 8173 1556] [15.06257439 15.38533878 16.16935158]
what is the best way to load a large billion-scale dataset like this?
When I downsample from 3.3 billion to 1 billion it didn't error right away, I was able to reconfigure using the minimum code above, but the 1 billion model crashes in the query()
step self.impl_cpp.query_ivf(q_, topk, tids, L)