smartDB - Efficient DB for ML Queries

An MVP of approximate query processing engine that supports SQL queries and returns results of ML models applied on unstructured input data

Problem

Design an intelligent DB, Pick an unstructured dataset of your choice and write an MVP of the query engine that supports one approximate query. It only needs to support that query, but you should be able to write it in SQL. You MUST support an approximate query. Write a few sentences about your design decisions and what could be done in the long term.

Implementation

As per the paper, implemented an AIDB structure and its component query engine.

Highlights

Used the three-table schema structure consisting of
- Input blob identification ‘Base tables’
- ‘ML mapping table’, created on the go optimally, as per query
- User-defined ‘metadata tables’
Supported approximate SQL query
Used stratified sampling technique to deal with humongous data and fetch results efficiently
Implemented Lazy Parsing of the query to create an ML mapping table with a limited number of columns that are required as per the query
Implemented ML Interface supporting multiple ML models
Enabled cache storage and retrieval
Configured input yaml, enabling the user to define schema for required tables and define ML models available.

Components of the design

Input Configuration yaml
CLI :
- Code to read input yaml and store its components, call query execution.
ComputeStatistics:
- Interface to compute and support linear aggregation queries
InputParser:
- Module to parse and store information from input sources
QueryParser:
- Module to parse and execute queries by calling functions from the ML interface, finally displays the result
QueryOptimizer:
- Module that implements lazy parsing and stratified sampling techniques
QueryCacher:
- Module to store and retrieve cache results of the recently executed query. It also has the functionality to clear the cache.
MLExecutor :
- An API to predict and return the results of ML models (available as per config.yaml) on input data tables as per the query given.

Input and Ouputs

config.yaml: This is an input file to be given by the user
cache.json: Created and maintained to produce results optimally
Command line outputs for detailing purposes are generated using commands below,

python .\CLI.py approx > approx_mode_output.txt

python .\CLI.py testcache > cache_mode_output.txt

Scope for long-term improvement

QueryParser can be extended to support multiple queries using the ‘sqlparse’ library.

ComputeStatistics can be extended to support more statistic functions, enabling DB to support more queries.
The cache structure can be improved to store better AUX_ML_MAPPING tables for better performance. Currently, it’s stored based on a recent query executed.
InputParser, the MLExecutor, can be more versatile to support multiple input sources.
QueryOptimizer can be extended to support more sampling techniques

Dataset used

Tweaked SMS data obtained from Kaggle and added a ‘class’ column to be used as metadata. (https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset )

Steps to run on any terminal

Clone the repo into your local workspace
Install packages mentioned in requirements.txt
Use:
- To test the complete execution of our APPROXIMATE query under consideration, On the command line, use:
  
  python ./CLI.py approx
- To test the working of cache, use cache mode, On the command line, use:
  
  python ./CLI.py testcache

zencodess/smartDB