/hsql

An SQL engine on top of Hadoop

Primary LanguagePython

hsql

An SQL engine on top of Hadoop

There are 4 files

  1. lexer.py : this does the job of lexing
  2. parser.py : this does the parsing of the SQL queries
  3. main.py : this is the which runs the interpreter (like python !) and you can run SQL Queries
  4. implemenation.py : this file contains the implemenations of SQL queries and some functions (Must be completed)

The SQL Queries Currently Supported :

  1. select
  2. use
  3. drop
  4. load
  5. create database
  6. schema (not there in standard SQL. Added to view schema of databases and tables)
  7. current database (again not there in standard SQL. Added to know the currently selected database)
  8. exit() or quit() (to quit the interpreter)

What are requirements ?

hadoop python3 ply

How to install ply ?

pip3 install ply

How to run the interpreter ?

python3 main.py

What's Currently Working ?

  1. use
  2. create database
  3. load (partially)
  4. drop
  5. schema
  6. current database

What's must be done ?

  1. Make it work on hadoop (create and delete files/folders in hadoop. currently made to work on file system and not hadoop. May have to change remove() in implementation.py. can do in end I guess)
  2. Implement load completely (currently only writing meta data (schema info) into database_name.schema. Must split the csv file into columns and store each column as separate file. All addresses are passed into load function. Must compelete it )
  3. Implement select
  4. Implement aggregate functions MAX, COUNT, SUM

Note : May have to write mapper/reducer in separate files and call them via system call in the wrapper functions select, load, MAX,COUNT and SUM via hadoop streaming API

The Directory Organization

DATABASE_ROOT/
    database_name.schema
    dblist.db
    database_name/
        table_name/     
                column_name

Note

  1. dblist.db is file which contains the list of all the databases (only 1)
  2. There is one schema file per database
  3. There is one directory for each database
  4. column files contain the data in a column (same column cannot repeat in the table can be found in other table in same db)

Have commented as many important lines as possible. If you have any doubts, call me.