matsui528/rii

RII with Billion scale dataset code configuration/reconfiguration crashes the kernel

Opened this issue · 3 comments

I have a RII object with 3.3 billion scale dataset which was batch loaded using the add(update_posting_lists=False) and then at the end I ran the reconfigure(), but it crashes the python kernel in the reconfigure step. I tried adding and then configuring it immediateIy and it was taking forever, didn't know if it worked or just hung. I was looking at the code and saw some comments about large memory consumption, is there a alternate way to do this without crashing?

Let me know the minimum code to reproduce the error.

import rii
import pickle 
import numpy as np
import nanopq

N, Nt, D = 3_300_000_000, 660_000, 75

#MemoryError: Unable to allocate 1.80 TiB for an array with shape (3_300_000_000, 75) and data type float64
# X = np.random.random((N, D)).astype(np.float32)  # 3_300_000_000  75-dim vectors to be searched

Xt = np.random.random((Nt, D)).astype(np.float32)  # 660_000 75-dim vectors for training
q = np.random.random((D,)).astype(np.float32)  # a 75-dim vector

# Prepare a PQ/OPQ codec with M=5 sub spaces
codec = nanopq.PQ(M=5).fit(vecs=Xt)  # Trained using Xt

# Instantiate a Rii class with the codec
e = rii.Rii(fine_quantizer=codec)

# Batch Add vectors - 1_000_000 x 3_300 times = 3.3 Billion 
# In reality data is loaded from Parquet files
for i in range(3_300):
    X = np.random.random((1_000_000, D)).astype(np.float32) 
    e.add(vecs=X, update_posting_lists=False)
    # e.reconfigure() ## takes longer and longer as the loop advances

e.reconfigure() # Crashes

# Search
ids, dists = e.query(q=q, topk=3)
print(ids, dists)  # e.g., [7484 8173 1556] [15.06257439 15.38533878 16.16935158]

what is the best way to load a large billion-scale dataset like this?

When I downsample from 3.3 billion to 1 billion it didn't error right away, I was able to reconfigure using the minimum code above, but the 1 billion model crashes in the query() step self.impl_cpp.query_ivf(q_, topk, tids, L)