Compute the exact 100 nearest neighbors for deep1M, deep10M, and deep100M datasets. We can use these neighbors as the ground truth for the search task for deep{1, 10, 100}M datasets.
Note that deep{1, 10, 100}M datasets are the top {1, 10, 100}M vectors of deep1b dataset, respectively.
You can download the results from here https://github.com/matsui528/deep1b_gt/releases/download/v0.1.0/gt.zip
- deep1M_groundtruth.ivecs
- deep10M_groundtruth.ivecs
- deep100M_groundtruth.ivecs
wget https://github.com/matsui528/deep1b_gt/releases/download/v0.1.0/gt.zip
unzip gt
# The directory structure will be:
# .
# ├── deep100M_groundtruth.ivecs
# ├── deep10M_groundtruth.ivecs
# ├── deep1M_groundtruth.ivecs
# └── gt.zip
git clone https://github.com/matsui528/deep1b_gt.git
cd deep1b_gt
# Download Deep1b data on ./deep1b. This may take several days. I recommend preparing 2TB of the disk space.
python download_deep1b.py --root ./deep1b
# After downloading the data, the structure of the directory would be:
# .
# ├── base
# │ ├── base_00
# │ ├── base_01
# │ ...
# │ ├── base_35
# │ └── base_36
# ├── base.fvecs # 388,000,000,000 bytes
# ├── deep1B_groundtruth.ivecs
# ├── deep1B_queries.fvecs
# ├── learn
# │ ├── learn_00
# │ ├── learn_01
# │ ...
# │ ├── learn_12
# │ └── learn_13
# └── learn.fvecs # 139,090,240,000 bytes
# rm -rf ./deep1b/base ./deep1b/learn # Optionally, you can delete base and learn, that should not be used anymore
# Compute groundtruth. You need faiss
conda install -c pytorch faiss-cpu
python compute_gt.py --out ./
# You'll get deep1M_groundtruth.ivecs, deep10M_groundtruth.ivecs, and deep100M_groundtruth.ivecs
As the deep1b dataset is too huge, you may want to download its subset (top 1M vectors) only. The following script will
- pick up the first 1M vectors from
base_00
to constructdeep1M_base.fvecs
- pick up the first 100K vectors from
learn_00
to constructdeep1M_learn.fvecs
git clone https://github.com/matsui528/deep1b_gt.git
cd deep1b_gt
# Download base_00, learn_00, and query on ./deep1b. This may take some hours. I recommend preparing 25GB of the disk space.
python download_deep1b.py --root ./deep1b --base_n 1 --learn_n 1 --ops query base learn
# Select top 1M vectors from base_00 and save it on deep1M_base.fvecs
python pickup_vecs.py --src ./deep1b/base/base_00 --dst ./deep1b/deep1M_base.fvecs --topk 1000000
# Select top 100K vectors from learn_00 and save it on deep1M_learn.fvecs
python pickup_vecs.py --src ./deep1b/learn/learn_00 --dst ./deep1b/deep1M_learn.fvecs --topk 100000
# After running the above commands, the structure of the directory would be:
# .
# ├── base
# │ └── base_00
# ├── deep1M_base.fvecs # 388,000,000 bytes
# ├── deep1B_queries.fvecs
# ├── learn
# │ └── learn_00
# └── deep1M_learn.fvecs # 38,800,000 bytes
# rm -rf ./deep1b/base ./deep1b/learn # Optionally, you can delete base and learn, that should not be used anymore
Several codes are from: