Scardina: Scalable Join Cardinality Estimatior

Prerequisites

All experiments can be run in a docker container.

Docker
GPU/cuda environment (for Training)

Getting Started

Setup

Dependencies are automatically installed while building a docker image.

# on host
git clone https://github.com/OnizukaLab/Scardina.git
cd Scardina
docker build -t scardina .
docker run --rm --gpus all -v `pwd`:/workspaces/scardina -it scardina bash

# in container
poetry shell

# in poetry env in container
./scripts/dowload_imdb.sh

Examples

Training

Choose hyperparameter search by optuna or manually specified parameters.

# train w/ hyperparameter search
python scardina/run.py --train -d=imdb -t=mlp --n-trials=10 -e=20

# train w/o hyperparameter search
python scardina/run.py --train -d=imdb -t=mlp -e=20 --d-word=64 --d-ff=256 --n-ff=4 --lr=5e-4

Evaluation

# evaluation
# Note: When default (-s=cin), model path should be like:
#       "models/imdb/mlp-cin/yyyyMMddHHmmss/nar-mlp-imdb-{}-yyyyMMddHHmmss.pt".
#       "{}" is literally "{}", a placeholder string to specify multiple models
python scardina/run.py --eval -d=imdb -b=job-light -t=mlp -m={path/to/model.pt}

You can find results in results/<benchmark_name> after trial.

Options

Common Options

-d/--dataset: Dataset name
-t/--model-type: Internal model type (mlp for MLP or trm for Transformer)
-s/--schema-strategy: Internal subschema type (cin for Closed In-neighborhood Partitioning (Scardina) or ur for Universal Relation)
--seed: Random seed (Default: 1234)
--n-blocks: The number of blocks (for Transformer)
--n-heads: The number of heads (for Transformer)
--d-word: Embedding dimension
--d-ff: Width of feedforward networks
--n-ff: The number of feedforward networks (for MLP)
--fact-threshold: Column factorization threshold (Default: 2000)

Options for Training

-e/--epochs: Training epoch
--batch-size: Batch size (Default: 1024)

(w/ hyperparameter search)

--n-trials: The number of trials for hyperparameter search

(w/ specified parameters)

--lr: Learning rate
--warmups: Warm-up epoch (for Transformer) (lr and warmups are exclusive)

Options for Evaluation

-m/--model: Path to model
-b/--benchmark: Benchmark name
--eval-sample-size: Sample size for evaluation

Choices

Datasets
- IMDb
  - imdb: (almost) All data of IMDb
  - imdb-job-light: Subset of IMDb for JOB-light benchmark
Benchmarks
- IMDb
  - job-light: Real-world 70 queries
  - job-m: Real-world 113 queries
  - job-light_subqueries: Real-world 70 queries for evaluating P-Error (Need DB)
  - job-m_subqueries: Real-world 113 queries for evaluating P-Error (Need DB)
Models
- mlp: MLP-based denoising autoencoder
- trm: Transformer-based denoising autoencoder

Reference

@article{scardina,
    author = {Ito, Ryuichi and Sasaki, Yuya and Xiao, Chuan and Onizuka, Makoto},
    title = {{Scardina: Scalable Join Cardinality Estimation by Multiple Density Estimators}},
    journal = {{arXiv preprint arXiv:2303.18042}},
    year = {2023}
}

Acknowledgement

Some source codes are based on naru/neurocard