/smartDB

An MVP of approximate query processing engine that supports SQL queries and returns results of ML models applied on unstructured input data

Primary LanguagePython

smartDB - Efficient DB for ML Queries

An MVP of approximate query processing engine that supports SQL queries and returns results of ML models applied on unstructured input data

Problem

Design an intelligent DB, Pick an unstructured dataset of your choice and write an MVP of the query engine that supports one approximate query. It only needs to support that query, but you should be able to write it in SQL. You MUST support an approximate query. Write a few sentences about your design decisions and what could be done in the long term.

Implementation

As per the paper, implemented an AIDB structure and its component query engine.

Highlights

  1. Used the three-table schema structure consisting of
    • Input blob identification ‘Base tables’
    • ‘ML mapping table’, created on the go optimally, as per query
    • User-defined ‘metadata tables’
  2. Supported approximate SQL query
  3. Used stratified sampling technique to deal with humongous data and fetch results efficiently
  4. Implemented Lazy Parsing of the query to create an ML mapping table with a limited number of columns that are required as per the query
  5. Implemented ML Interface supporting multiple ML models
  6. Enabled cache storage and retrieval
  7. Configured input yaml, enabling the user to define schema for required tables and define ML models available.

Components of the design

  1. Input Configuration yaml
  2. CLI :
    • Code to read input yaml and store its components, call query execution.
  3. ComputeStatistics:
    • Interface to compute and support linear aggregation queries
  4. InputParser:
    • Module to parse and store information from input sources
  5. QueryParser:
    • Module to parse and execute queries by calling functions from the ML interface, finally displays the result
  6. QueryOptimizer:
    • Module that implements lazy parsing and stratified sampling techniques
  7. QueryCacher:
    • Module to store and retrieve cache results of the recently executed query. It also has the functionality to clear the cache.
  8. MLExecutor :
    • An API to predict and return the results of ML models (available as per config.yaml) on input data tables as per the query given.

Input and Ouputs

  1. config.yaml: This is an input file to be given by the user

  2. cache.json: Created and maintained to produce results optimally

  3. Command line outputs for detailing purposes are generated using commands below,

    python .\CLI.py approx > approx_mode_output.txt

    python .\CLI.py testcache > cache_mode_output.txt

Scope for long-term improvement

  1. QueryParser can be extended to support multiple queries using the ‘sqlparse’ library.

    ComputeStatistics can be extended to support more statistic functions, enabling DB to support more queries.

  2. The cache structure can be improved to store better AUX_ML_MAPPING tables for better performance. Currently, it’s stored based on a recent query executed.

  3. InputParser, the MLExecutor, can be more versatile to support multiple input sources.

  4. QueryOptimizer can be extended to support more sampling techniques

Dataset used

Tweaked SMS data obtained from Kaggle and added a ‘class’ column to be used as metadata. (https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset )

Steps to run on any terminal

  1. Clone the repo into your local workspace
  2. Install packages mentioned in requirements.txt
  3. Use:
    • To test the complete execution of our APPROXIMATE query under consideration, On the command line, use:

      python ./CLI.py approx

    • To test the working of cache, use cache mode, On the command line, use:

      python ./CLI.py testcache