aquery: A Scala repository from josepablocam

AQuery to Q (A2Q)

A2Q is a project that explores order-based optimizations for a simple query language. Optimizations are mainly targeted as AST rewrites and, though simple, yield empirical advantages at runtime.

It compiles AQuery into q, providing users with two advantages:

Allows users to query kdb+ databases with a syntax similar to SQL
Heuristic, order-based optimizations, which provide performance advantages, while maintaining readability in the source code
Transparent compilation into q code (allowing experienced users to inspect the code executed)

The transformations on queries provide the same semantics at a fraction of the cost. These involve: pushing selections below sorting, removing unnecessary sorts, maintaining order across local tables, amongst others.

Updates

This is a clean, Scala-based, rewrite of the AQuery system. The original implementation was developed in C. Note that the C implementation is no longer being supported and we encourage users to switch to the Scala version.

Installation/Building

If you are interested in using AQuery, the steps below should provide you with all the information needed to build on your system. Note that A2Q is meant for *nix. There are no plans on making this available for windows.

As AQuery currently compiles into q code, it is essential to have the kdb+ executable installed. You can visit kx systems to download the appropriate version for your machine. For licensing reasons, we cannot include this for you. However, the process is simple.

Once you install q, make sure you add its installation location to your path.

export PATH=$PATH:/my/q/installation/location/

Calling which q should echo back the appropriate location, if added succesfully.

Most uses of q in examples/demos will assume you have this appropriately setup.

Compilation of the AQuery translator is straightforward and handled exclusively by sbt (the Scala build tool used in the project). Installing sbt is straightforward and there are plenty of resources online. Once you have sbt installed it is a simple as

The first step is to clone this git repository

git clone git@github.com:josepablocam/aquery.git

You can then build by running

cd aquery/
sbt assembly

This will save the AQuery jar as target/scala-2.11/aquery.jar.

To get basic usage information you can run

java -jar aquery.jar -h

For convenience, we recommend creating a simple bash script that wraps the call to java and adding this to your path.

cat ./a2q
#!/bin/bash
java -jar target/scala-2.11/aquery.jar $*

For following examples, we assume that we have this script available by the name a2q and you have added the location of the script to your PATH variable (so that you can call as simply a2q).

The most common command will likely be translating AQuery to q code, with optimizations. This can be achieved with

a2q -a 1 -s -c -o my_output_file.q my_input_file.a

or equally

a2q -a 1 -s -c my_input_file.a > my_output_file.q

(Note that the file endings are not significant).

Basic Grammar

The grammar for AQuery is simple and should be familiar to most people familiar with SQL. A good way to get a thorough overview of the grammar is to explore the parser rules in src/parser/aquery.y. These are written for Bison, and should be fairly readable. We provide a simple summary here of the most frequently used constructs: queries, data creation, loading/saving data, and user-defined functions. We use ID to stand for identifier, and epsilon to stand for the standard notation.

program : [full-query | create | insert | load | dump | udf | verbatim-q]*

// aquery files allow standard q code within special markers
verbatim-q: <q> q code </q>

/********* Queries *********/
full-query : [local_queries] query

local_queries: WITH local_query+

local_query: ID ['('columns')'] AS '(' query ')'

query : SELECT projections FROM source order-clause where-clause groupby-clause

projections: [val as ID | val] (, [val as ID | val])*

source : ID [ID | AS ID] |
  ID INNER JOIN source USING (columns) |
  ID FULL OUTER JOIN source USING (columns) |
  fun_call

// fun can be verbatim-q function constructing table
// or FLATTEN/DISTINCT/CONCATENATE
fun_call: fun'('val (,val)*')'

order-clause: ASSUMING ([ASC|DESC] ID)+

where-clause: WHERE search-conditions;

groupby-clause: GROUP BY val [as ID] (, val [as ID])* [HAVING search-conditions]

// where search condition is a boolean-yielding expression
search-conditions: search-condition (AND search-condition)*;

/********* Creating data *********/
create: CREATE TABLE ID [AS query | '(' schema ')']
schema: ID type (, ID type)*

insert: INSERT INTO ID [query | VALUES '(' vals ')']
vals: val (, val)*;

/********* Loading/Saving data *********/
load: LOAD DATA INFILE str INTO TABLE ID FIELDS TERMINATED BY str

save: query INTO OUTFILE str FIELDS TERMINATED BY str

/********* User defined functions *********/
udf: FUNCTION ID '(' arg-list ')' '{' fun-body '}'
arg_list: ID (, ID)*
fun_body: val | ID := val;

/********* Expressions *********/
val: val binop val | fun_call | -val | ID | int |
  float | datetime | str | date | hex
fun: ID | abs| avg[s] | count | deltas | distinct | drop
  | fill | first | last | max[s] | min[s] | mod | next
  | prev | sum[s] | stddev | ratios | vars | moving

binop: +|-|=|*|!=|<|>|>=|<=|&&|||

Basic Examples

Following the commonly user grammar above, we provide some basic examples.

Queries

AQuery does not allow nested queries, but you can easily use WITH to construct as many intermediate, query-local tables. These tables are only available in subsequent queries within the same WITH statement.

WITH t1(c1, c2, c3) AS (
  SELECT c10, c20, c30 FROM t ASSUMING ASC c4
  WHERE f(c3) > 10 AND c5 != c6
  )
SELECT c1, sum(c2) FROM t1 GROUP BY c1

WITH nested_t1(c1, c2, c3) AS (
  SELECT c10, c20, c30 FROM t ASSUMING ASC c4
  WHERE f(c3) > 10 AND c5 != c6
  GROUP BY c10
  )
// must flatten out nested arrable using built-in FLATTEN
SELECT c1, sum(c2) FROM FLATTEN(nested_t1) GROUP BY c1

You are also free to have function calls that return a table as part of the from-clause. For example

SELECT * FROM FLATTEN(CONCATENATE(nested_1, nested_2, nested_3))

which uses the built-ins FLATTEN and CONCATENATE (DISTINCT is also available as a built-in). Users are also free to define functions (in verbatim-q for now) that return tables and use these in their queries.

<q>f:{([]c1: 1 2 3 1; c2:10 20 30 40)}</q>
SELECT * FROM f() WHERE c1 = 1

The only restrictions on these functions currently is that you cannot use dot notation to access table columns in the call. So for example

<q>
h:{([]c1:x * 2)};
t:([]c1:1 2 3);
</q>
SELECT * FROM h(t.c1)

is not valid. However, there is a simple workaround

SELECT * FROM h(t("c1"))

Creating Data

You can create data directly in AQuery as done below, but often the use of verbatim q <q> q commands here </q> allows for simple data creation for experiments.

CREATE TABLE my_table (c1 INT, c2 INT, c3 STRING)
INSERT INTO my_table VALUES(10, 20, "example")
INSERT INTO my_table SELECT * FROM my_table

You can also create tables using a query. For example:

CREATE TABLE my_table_derived
AS
  SELECT c1, c2 * 2 as twice_c2 FROM my_table

Note that the query can include any valid query-construct, such as a WITH statement.

Loading and Saving Data

For this example, we'll create a simple csv file from the shell command line

josecambronero demo$ echo "ID,val" > my_table.csv;\
 for i in {1..10}; do echo "$i,$((i * 10))" >> my_table.csv; done

We must first declare the schema of our table, as this is required to parse values to the appropriate type.

CREATE TABLE my_table (ID INT, val INT)

We can now parse in the csv file and insert the records into our predefined table.

LOAD DATA INFILE "my_table.csv"
INTO TABLE my_table
FIELDS TERMINATED BY ","

We can now perform a query and save the results into a new file. In this case, we use a pipe-delimited file instead.

SELECT ID, val, val * 2 as derived_val FROM my_table
INTO OUTFILE "new_my_table.csv"
FIELDS TERMINATED BY "|"

josecambronero demo$ cat new_my_table.csv
ID|val|derived_val
1|10|20
2|20|40
3|30|60
4|40|80
5|50|100
6|60|120
7|70|140
8|80|160
9|90|180
10|100|200

User-Defined Functions

Users can define their own aggregates using a simple syntax, described briefly in the grammar section. The main idea is: you can define local variables using :=, consecutive commands must be concatenated with ;, and the final result of the function corresponds to the last expression. If there is a ; on the last expression, the function produces no result. This closely mirrors function definitions in q, with some minor changes.

In the example below, we define our own covariance aggregate, for use in our queries. You are free to use other aggregates in the definition. In this case we use built-ins such as avg, sum, and sqrt, which are self-explanatory.

FUNCTION myCov(x, y) {
  center_x := x - avg(x);
  center_y := y - avg(y);
  num := sum(center_x * center_y);
  denom := sqrt(sum(center_x * center_x)) * sqrt(sum(center_y * center_y));
  num / denom
  }

Function Semantics

Arrables, the main data structure in AQuery, are designed to address values column-wise, as vectors. So for example, given a function and query as

FUNCTION my_avg(v) {sum(v) / count(v)}
SELECT my_avg(c1) FROM my_table

the call to my_avg passes c1 as a vector to my_avg, in contrast to passing each value of c1 separately. This vector-oriented mentality is critical to AQuery (and the underlying q/kdb+ systems), and results in more intuitive expressions.

(As a side-note, there is already a built-in called avg that calculates the average, but we reproduce our own here as an exercise in UDF writing).

As highlighted before, arrables can be nested. Consider the simple table shown below:

>SELECT * FROM my_table
>
c1 c2
-----
1  10
1  20
2  30
2  40
3  50

We can create a nested-arrable by using a GROUP-BY.

>CREATE TABLE nested_my_table AS SELECT c1, c2 FROM my_table GROUP BY c1
>
c1 c2    
---------
1  10 20i
2  30 40i
3  ,50i

Note that c2 now consists of nested vectors.

Now if you want to apply an aggregate on each of the vectors constituting the nested c2, you need to use a higher-order function that modifies the application of the function. These are currently not exposed directly in AQuery, as they constitute a more advanced function. However, users can directly access them using verbatim-q. In this case, they can use q's adverbs.

So our my_avg can be modified to

<q>my_avg_each:my_avg'</q>

We can now apply the aggregate on a query on nested_my_table and obtain the expected results:

>SELECT c1, my_avg_each(c2) as avg_val FROM nested_my_table
>
c1 avg_val
----------
1  15     
2  35     
3  50

Note that if an aggregate is used in the same query that has the GROUP BY clause, then AQuery automatically introduces the necessary modifiers in the aggregate application. So the following is equally valid:

>SELECT c1, my_avg(c2) as avg_val
>FROM my_table GROUP BY c1
>
c1 avg_val
----------
1  15     
2  35     
3  50

It is also worth highlighting that one level of nesting can be removed by applying FLATTEN (currently the only function application allowed in a FROM clause).

>SELECT * FROM FLATTEN(nested_my_table)
>
c1 c2
-----
1  10
1  20
2  30
2  40
3  50

We consider another example, in which we have two levels of nesting. We create a very simple table meant to reflect data collected in a timed experiment.

>CREATE TABLE t(indiv INT, grp STRING, val INT)
>INSERT INTO t VALUES(1, "A", 1)
>INSERT INTO t VALUES(1, "A", 2)
>INSERT INTO t VALUES(1, "A", 3)
>INSERT INTO t VALUES(1, "A", 4)
>INSERT INTO t VALUES(2, "A", 2)
>INSERT INTO t VALUES(2, "A", 2)
>INSERT INTO t VALUES(2, "A", 4)
>INSERT INTO t VALUES(2, "A", 8)
>INSERT INTO t VALUES(3, "B", 10)
>INSERT INTO t VALUES(3, "B", 20)
>INSERT INTO t VALUES(3, "B", 30)
>INSERT INTO t VALUES(3, "B", 40)
>INSERT INTO t VALUES(4, "B", 20)
>INSERT INTO t VALUES(4, "B", 20)
>INSERT INTO t VALUES(4, "B", 40)
>INSERT INTO t VALUES(4, "B", 80)
>

>SELECT * FROM t
>
indiv grp val
-------------
1     A   1  
1     A   2  
1     A   3  
1     A   4  
2     A   2  
2     A   2  
2     A   4  
2     A   8  
3     B   10
3     B   20
3     B   30
3     B   40
4     B   20
4     B   20
4     B   40
4     B   80

We create our first level of nesting by grouping by individual and group (indiv and grp, respectively).

>CREATE TABLE nested_1_t AS
>SELECT indiv, grp, val as timeseries
>FROM t GROUP BY indiv, grp
>
indiv grp timeseries
---------------------
1     A   1  2  3  4
2     A   2  2  4  8
3     B   10 20 30 40
4     B   20 20 40 80

This resembles our prior example. We could apply an aggregate on each row of timeseries by using a modifier as described before.

But consider another experiment, in which we fuse nested arrays, showing that we can create say a global time series that is the average of the given ones in a group.

>SELECT grp, avg(timeseries) as avg_time_series
>FROM nested_1_t GROUP BY grp
>
grp avg_time_series
-------------------
A   1.5 2  3.5 6   
B   15  20 35  60

Let's look at the nested arrable, before we apply our aggregate.

>SELECT grp, timeseries FROM nested_1_t GROUP BY grp
>
grp timeseries             
---------------------------
A   1 2 3 4     2 2 4 8    
B   10 20 30 40 20 20 40 80

Note that there are now two levels of nesting. Each row consists of two separate vectors. If there were n individuals per group, these would then be n-vectors and so forth.

So once again, the query with GROUP BY works as expected because AQuery added the necessary modifiers to the aggregate application behind the scenes, but if you want to do this in two separate steps, you need to use our modified aggregate.

>WITH
>temp AS (SELECT grp, timeseries FROM nested_1_t GROUP BY grp)
>SELECT grp, my_avg_each(timeseries) as avg_time_series FROM temp
>
grp avg_time_series
-------------------
A   1.5 2  3.5 6   
B   15  20 35  60

Filling in missing values

We now consider a common case in timeseries analysis: filling in missing values. Consider the the table defined below:

<q>
 t1:([]indiv:1 1 1 1 1; ts:0 1 2 3 4; val:10 20 30 40 50);
 t2:([]indiv:2 2 2; ts:2 3 4; val:30 40 50);
 t:t1,t2;
 </q>

Individual two is missing observations for timestamp (represented here as an integer for simplicity) zero and one.

In order to calculate correlation between the values for both individuals, we extend the timeseries for individual two.

<q>corEach:cor';</q>
FUNCTION fill_backward(x) {reverse(fills(reverse(x)))}

WITH
  ts AS (SELECT ts FROM t)
  indiv AS (SELECT indiv FROM t)
  complete AS (SELECT * FROM distinct(ts), distinct(indiv))
  filled AS (SELECT   
    indiv, ts,
    fill_backward(val) as filled_val
    FROM
    complete FULL OUTER JOIN t USING (indiv, ts)
    ASSUMING ASC ts
    GROUP BY indiv)
   SELECT
   f1.indiv as indiv1, f2.indiv as indiv2,
   corEach(f1.filled_val, f2.filled_val) as filled_corr
   FROM filled f1, filled f2 WHERE f1.indiv != f2.indiv

ts and indiv are single column tables representing all timestamps and individuals, respectively. complete is a cartesian-product of the distinct values in the two tables, generating a complete history for each individual. filled uses a UDF fill_backward to fill any missing values with the following non-missing value (hence the name). This is used in complete by joining our original table with our extended history table and applying the UDF by individual. Finally, our resulting query uses the filled data to calculate the correlation between the two individuals' timeseries.

If you have any doubts/issues/feedback, please do not hesitate to contact Jose Cambronero