/MLIB

M(ulti-dimensional)L(earned)I(ndex)B(ench)

Primary LanguageJupyter Notebook

Evaluating Mtulit-dimensional Learned Indices

This is the source code repo for our emperical study on multi-dimensional learned indices and the Multi-dimensional Learned Indices Benchmark.

Compared Methods

Learned Indices

We compare six recent multi-dimensional learned indices:

  • ZM-Index [1]
  • ML-Index [2]
  • IF-Index [3]
  • RSMI [4]
  • LISA [5]
  • Flood [6]

Non-learned Baselines

  • FullScan: sequential scan
  • R*-tree and bulk-loading R-tree: we use the implementation from boost::geometry
  • kdtree: we use a header-only kdtree implementation nanoflann https://github.com/jlblancoc/nanoflann
  • ANN: another kntree viriant from ANN project http://www.cs.umd.edu/~mount/ANN/
  • Quad-tree: we use the implementation from GEOS
  • Grid: uniform grid (UG) and equal-depth grid (EDG)

Compilation

Step 1: Setup Dependencies

Step 2: Build RSMI and ANN

Most of the benchmark and indices (except RSMI and ANN) are implemented as header-only libraries.

Compile RSMI:

cd indexes/rsmi
mkdir build && cd build
cmake ..
make
[S]

Compile ANN:

cd indexes/ann_1.1.2
make linux-g++
[Failed,暂时没空搞懂原因,直接从官网下载了一个官方版本作为替代]

Step 3: Build Benchmark

Modify the following variables in CMakeLists.txt:

BOOST_ROOT, Boost_INCLUDE_DIR, Boost_LIBRARY_DIR: path to boost
TORCH_PATH: path to libtorch
EXECUTABLE_OUTPUT_PATH: path to compiled benchmark binaries

Compile RSMI benchmark:

mkdir build && cd build
cmake .. -DRSMI=ON
make
[Failed,没有规则可制作目标“/home/metetor/learnedbench/indexes/rsmi/build/librsmi.a,由“/home/metetor/learnedbench/build/bin/bench_rsmi” 需求]
[Checkout:CMakeLists.txt中RSMI和ANN路径错误,并不在/home文件夹下]
[S]

Compile RSMI benchmark with heap profiling enabled:

rm -rf * # clear cmake cache
cmake .. -DRSMI=ON -DPROFILE=ON
make
[Failed,1.bench_rsmi.cpp:55:31: error: ‘DIM’ was not declared in this scope]
[Checkout:将DIM改成BENCH_DIM]
[S]

Compile benchmark for other indices:

rm -rf * # clear cmake cache
cmake ..
make
[S]

Compile benchmark for other indices with heap profiling enabled:

rm -rf * # clear cmake cache
cmake .. -DPROFILE=ON
make
[S]

Run Experiments

We prepare a script to download the real datasets and prepare synthetic datasets:

cd scripts
bash prepare_data.sh
[Failed,1.mkdir语法错误,无法创建目录]
[Checkout,1.mkdir "$PATH",才会解释为变量]
[S]
[Failed,2.真实文件夹无法下载,疑似没有cookie文件]
[Checkout,先将现在真实数据集代码注释掉,只生成合成数据集]
[TODO]
[Try,下载并使用python预处理nyc-tlc的2016-01-2016-06的数据集][s]
[Try,处理FS数据集]
[Try,处理Tronto3D数据集]
[Try,处理OSM数据集]
[Failed,3.generate synthetic data failed]
[Checkout,文件路径错误,CMakeLists.txt中生成执行文件路径错误,and prepare_data.sh中的BENCH_BIN路径错误]
[S]

now we have part of real dataset(nytaxi 201601-201606 1.03GB),and we test on it

We prepare several scripts to run the experiments.

Run experiments on default settings: bash run_exp.sh

[error:Benchmark mli dataset lognormal_20m_2_1 error terminate called after throwing an instance of 'std::logic_error' what(): Points must be increasing by x. ] [ error:benchmark real dataset failed ] Run experiments by varying N: bash run_exp_n.sh

Run experiments by varying dim: bash run_exp_dim.sh

Run experiments by varying eps: bash run_exp_eps.sh

Run experiments of RSMI: bash rsmi.sh

The results are put in /project_root/results, and the figure drawing Jupyter notebooks are put in /project_root/figures.

Reference

[1] Haixin Wang, Xiaoyi Fu, Jianliang Xu, and Hua Lu. 2019. Learned Index for Spatial Queries. In MDM. IEEE, 569–574.

[2] Angjela Davitkova, Evica Milchevski, and Sebastian Michel. 2020. The ML-Index: A Multidimensional, Learned Index for Point, Range, and Nearest-Neighbor Queries. In EDBT. OpenProceedings.org, 407–410.

[3] Ali Hadian, Ankit Kumar, and Thomas Heinis. 2020. Hands-off Model Integration in Spatial Index Structures. In AIDB@VLDB.

[4] Jianzhong Qi, Guanli Liu, Christian S. Jensen, and Lars Kulik. 2020. Effectively Learning Spatial Indices. Proc. VLDB Endow. 13, 11 (2020), 2341–2354.

[5] Pengfei Li, Hua Lu, Qian Zheng, Long Yang, and Gang Pan. 2020. LISA: A Learned Index Structure for Spatial Data. In SIGMOD Conference. ACM, 2119–2133.

[6] Vikram Nathan, Jialin Ding, Mohammad Alizadeh, and Tim Kraska. 2020. Learning Multi-Dimensional Indexes. In SIGMOD Conference. ACM, 985–1000.