A small SQL engine for CSV files.
This project is an OCaml implementation of a SQL engine for CSV files.
It was written by Pierre Oechsel and Romain Liautaud for the DBDM course of the ENS de Lyon.
This project depends on a few packages, which can be installed using OPAM:
opam switch 4.05.0
opam install ocamlbuild
opam install ppx_deriving
opam install menhir
opam install fpath
opam install csv
Then, building the project is as simple as running make
.
For a better experience using the REPL, we recommand that you install rlwrap.
Prequel is covered by 38 unit tests and 5 integration tests.
To run them, you have to install OUnit2 using opam install ounit2
, and then run make test
.
To play around with the engine, just run ./prequel
to get a SQLite-like REPL.
Here are a few examples of queries:
SELECT * FROM "tests/sources/projets.csv" p;
SELECT * FROM "tests/sources/projets.csv" p WHERE p.idp > p.responsable;
SELECT * FROM "tests/sources/projets.csv" p ORDER BY p.idp DESC;
SELECT * FROM "tests/sources/projets.csv" p GROUP BY p.responsable;
You can also use the engine from the command line:
Prequel version 1.1.
Usage: ./prequel [path]
When path is not specified, runs in REPL mode.
--output A file in which to write the output.
--graph A file in which to save a graph of the term.
--use-caching Enable caching optimization.
--no-select-push-down Disable push down of selections.
--no-projection-opti Disable optimisations of projections.
--no-joins Disable the use of joins.
--big-data Use algorithms suited for large datasets.
-help Display this list of options
--help Display this list of options
Prequel supports the following constructs:
- Using the
UNION
andMINUS
set operators. - Projecting on some or all columns (using
SELECT * FROM ◽
). - Computating of arbitraty arithmetic expressions (e.g.
SELECT 1 + 1 FROM ◽
). - Renaming of columns using
AS
(e.g.SELECT foo.bar AS baz FROM ◽
). - Filtering rows using the
WHERE
clause (e.g.SELECT * FROM ◽ WHERE foo.bar + foo.baz < 10
). - Using subqueries:
- In the
FROM
clause (e.g.SELECT table.foo FROM (SELECT * FROM ◽) AS table
). - In the
WHERE
clause, usingWHERE IN
andWHERE NOT IN
.
- In the
- Ordering rows by arbitrary expressions (e.g.
SELECT * FROM ◽ ORDER BY foo.bar ASC, foo.baz + foo.bam DESC
). - Grouping rows by arbitrary expressions (e.g.
SELECT * FROM ◽ GROUP BY foo.baz + foo.bam
). - Using the
MAX
,MIN
,AVG
andCOUNT
aggregate functions:- In the
SELECT
clause (e.g.SELECT MAX(foo.bar) FROM ◽ GROUP BY foo.baz
). - In the
HAVING
clause (e.g.SELECT * FROM ◽ GROUP BY foo.baz HAVING MAX(foo.bar) > 5
).
- In the
Query examples for all those features can be found in the tests/
folder.
Under the hood:
-
Prequel parses the SQL query into an instance of
Ast.query
. -
It then applies several checking and optimization phases to that query (see
AstChecker
andAstTransform
):- It gives each table a unique name to avoid attribute shadowing.
- It normalizes the query, putting all the conditions in disjunctive form.
-
It compiles that query into a relational term.
During compilation, it tries to replace expresions of the form
SELECT * from a, b WHERE a.foo = b.bar
with joins. -
This term is then further optimized, by trying to:
- push
Select
s down; - optimize the placement of projections (removing and adding some if needed);
- substituting expressions of the form
Select_{a.foo = b.bar}(Product(α, β))
, whereα
containsa
andβ
containsb
, with aJoin
.
- push
-
Finally, it converts the term to a physical plan, and evaluates it lazily.
Since the start of the project we've decided to use lazy evaluation (in the form of iterators, or feeds
as we call them) when executing the physical plan. This allows us to have efficient algorithms both in speed and in memory.
One bottleneck to this approach was sorting. Sorting is a cornerstones of our algorithms, but it is impossible to sort a feed without storing it. We designed an external sort which allow us to sort big files with a low memory footprints. We also took care of optimizing it until we couldn't think of ways to optimize it anymore. As a result, we achieve to sort a 400mo CSV files in ~70/75s using less than 100mo of RAM.
We also designed two versions of union, minus and join. One if the entire file fits in memory, which uses Hashmaps. The other if the file can't be stored into RAM. In this case we're using external sorting extensively.
The computation of aggregate functions was a little tricky.
Because the Group
relational operator only outputs one row per group, operators that would come after it (e.g. a Projection
when using SELECT
with an expression containing an aggregate function, or a Select
when using a HAVING
clause) would lose necessary information about the inside of the group.
To work around this, Prequel first extracts all the calls to aggregates functions in AstChecker.extract_aggregates
in the SELECT
and HAVING
clauses, and replaces them with a reference to an attribute with a uniquely generated name. Then, when compiling the query, the Group
operator receives a list of (name, aggregate function call)
couples to compute for each group. This way, the Group
operator is able to output extra columns for each function call, which can be reused by operators that come after it.
Because of our preprocessing pass, subqueries caching is not trivial to do in our impementation. In fact, a subtree a
and a subtree b
represents the same "expression" if the naming of the tables is "coherent", that is if we can rename the tables of a
to obtain b
.
So, at the moment, caching is not well tested and even buggy in some case -- and we mostly cache nodes of the form Input("foo.csv")
, which is quite dumb. Because of this, it is disabled by default, and can be enabled using the --use-caching
option.
-
Further debugging for subqueries caching.
-
A simple cost model.
In order to choose which implementation of the algorithms to choose (UnionHash or UnionSort for example), we thought of adding a cost model which would estimate the number of elements produced by every subtree.
To do so, we would have computed, for each input CSV file:
- the number of rows;
- for each attribute the maximum number of same values.
Using this data, we can over-approximate the number of elements produced by a
Select
, aProjection
, aProduct
, etc. Then, if this estimation is greater than a fixed offset, we pick an algorithm which is able to perform on huge datasets. -
A join sort algorithm scaling to big tables.
For now the JoinSort algorithm keeps a lot of thing in memory. This could easily be modified in order to store intermediate results on the hard drive: in the same way as with the external sort, we can split the sorted results into "chunks" that can fit into memory.
Then, we can identify in which chunk the first occurence of the sibling's row is (using a dichotomy on the first row of each chunk). Finally, we can iterate from this first row to the next row until we have a row that can't be joined.
-
Implementing
JoinProjectRename
andReadSelectProjectRename
.It can be easily done inside the method
feed_from_query
by using pattern matching. We didn't think this optimisation was really useful as our algorithms are already optimised for memory thanks to a feed approach. Time was lacking to add placeholders for these two "meta" relations. -
A
DISTINCT
operator. We didn't have time to implement it, but it is really easy to do using either aGroupBy
or aSort
.