deep1b_gt

Compute the exact 100 nearest neighbors for deep1M, deep10M, and deep100M datasets. We can use these neighbors as the ground truth for the search task for deep{1, 10, 100}M datasets.

Note that deep{1, 10, 100}M datasets are the top {1, 10, 100}M vectors of deep1b dataset, respectively.

Result

You can download the results from here https://github.com/matsui528/deep1b_gt/releases/download/v0.1.0/gt.zip

deep1M_groundtruth.ivecs
deep10M_groundtruth.ivecs
deep100M_groundtruth.ivecs

wget https://github.com/matsui528/deep1b_gt/releases/download/v0.1.0/gt.zip
unzip gt

# The directory structure will be:
# .
# ├── deep100M_groundtruth.ivecs
# ├── deep10M_groundtruth.ivecs
# ├── deep1M_groundtruth.ivecs
# └── gt.zip

How to run by yourself

git clone https://github.com/matsui528/deep1b_gt.git
cd deep1b_gt

# Download Deep1b data on ./deep1b. This may take several days. I recommend preparing 2TB of the disk space.
python download_deep1b.py --root ./deep1b

# After downloading the data, the structure of the directory would be: 
# .
# ├── base
# │   ├── base_00
# │   ├── base_01
# │   ...
# │   ├── base_35
# │   └── base_36
# ├── base.fvecs                # 388,000,000,000 bytes
# ├── deep1B_groundtruth.ivecs
# ├── deep1B_queries.fvecs
# ├── learn
# │   ├── learn_00
# │   ├── learn_01
# │   ...
# │   ├── learn_12
# │   └── learn_13
# └── learn.fvecs               # 139,090,240,000 bytes

# rm -rf ./deep1b/base ./deep1b/learn    # Optionally, you can delete base and learn, that should not be used anymore

# Compute groundtruth. You need faiss
conda install -c pytorch faiss-cpu
python compute_gt.py --out ./

# You'll get deep1M_groundtruth.ivecs, deep10M_groundtruth.ivecs, and deep100M_groundtruth.ivecs

(Bonus) Deep1M

As the deep1b dataset is too huge, you may want to download its subset (top 1M vectors) only. The following script will

pick up the first 1M vectors from base_00 to construct deep1M_base.fvecs
pick up the first 100K vectors from learn_00 to construct deep1M_learn.fvecs

git clone https://github.com/matsui528/deep1b_gt.git
cd deep1b_gt

# Download base_00, learn_00, and query on ./deep1b. This may take some hours. I recommend preparing 25GB of the disk space.
python download_deep1b.py --root ./deep1b --base_n 1 --learn_n 1 --ops query base learn 

# Select top 1M vectors from base_00 and save it on deep1M_base.fvecs
python pickup_vecs.py --src ./deep1b/base/base_00 --dst ./deep1b/deep1M_base.fvecs --topk 1000000

# Select top 100K vectors from learn_00 and save it on deep1M_learn.fvecs
python pickup_vecs.py --src ./deep1b/learn/learn_00 --dst ./deep1b/deep1M_learn.fvecs --topk 100000

# After running the above commands, the structure of the directory would be: 
# .
# ├── base
# │   └── base_00
# ├── deep1M_base.fvecs                # 388,000,000 bytes
# ├── deep1B_queries.fvecs             
# ├── learn
# │   └── learn_00
# └── deep1M_learn.fvecs               # 38,800,000 bytes

# rm -rf ./deep1b/base ./deep1b/learn    # Optionally, you can delete base and learn, that should not be used anymore

Reference

Several codes are from: