/cardest

Code for Local Deep Learning Models for Cardinality Estimation

Primary LanguagePythonMIT LicenseMIT

Cardinality Estimation with Local Deep Learning Models

This repository contains the code to reproduce the local deep learning models from [1][2].
The repository contains the following python files:

  • prepare_db.py for setting up the local context in your db and gather meta information about it.
  • query_processing.py for sampling and vectorizing queries and getting their true cardinality.
  • learn_model.py train and test the local deep learning model.

The first script generates a joined table from the chosen base tables (see config.json). It also gathers the ranges of each column and dictionary-encodes all columns of type character. The query processing file samples a given number of queries from your db with cardinality larger than 0. After sampling, it vectorizes the queries as proposed in [1]. You can just use the vectorizer on a given query_file by calling the script with python query_processing.py -v 1. The final script trains a neural network for cardinality estimation. The communication between these files can be seen as a flow chart. flow chart

Additionally, there is a configuration file called config.json. It contains all parameters for your setup.

{
    "dbname": "imdb", # name of your db, this will be used as a suffix to name all files
    "vector_file": "vectors_imdb.npy", # file name of the vectorized query output incl. cardinalities
    "query_file": "queries_imdb.csv", # file name of the SQL queries incl. cardinalities
    "model_file": "model.h5", # file name for model, supports .h5 and .pb
    "number_of_queries": 105000, # number of queries to  be sampled
    "tables": ["title", "movie_info"], # tables of the local context...
    "join_ids": [["id", "movie_id"]], # ... and their join keys
    "columns": ["imdb_index", "kind_id", "production_year", 
                "phonetic_code", "episode_of_id", "season_nr", 
                "episode_nr", "info_type_id"], # columns you want to use from the original data
    "model_depth": 2, # number of layers in the neural network
    "model_width": 512, # number of neurons in the first layer (funnel architecture)
    "test_size": 5000, # number of queries in the test sets
    "runs": 5, # number of test runs (k-fold cross-validation)
    "view_name": "tmpview", # name for the intermediate table for query sampling
    "optim": 1 # whether the optimization from [2] should be used
}

References

[1] Woltmann et al., Cardinality estimation with local deep learning models, aiDM@SIGMOD 2019
[2] Woltmann et al., Aggregate-based Training Phase for ML-based Cardinality Estimation, Datenbank-Spektrum 2022

Cite

Please cite our papers if you use this code in your own work:

@article{woltmann2019localdeep,
  title = {Cardinality estimation with local deep learning models},
  author = {Woltmann, Lucas and Hartmann, Claudio and Thiele, Maik and Habich, Dirk and Lehner, Wolfgang},
  booktitle = {Proceedings of the Second International Workshop on Exploiting Artificial Intelligence Techniques for Data Management},
  series = {aiDM@SIGMOD '19},
  year = {2019}
}

@article{woltmann2022aggregate,
  title={Aggregate-based training phase for ML-based cardinality estimation},
  author={Woltmann, Lucas and Hartmann, Claudio and Habich, Dirk and Lehner, Wolfgang},
  journal={Datenbank-Spektrum},
  pages={1--13},
  year={2022},
  publisher={Springer}
}