OpenGen/GenSQL.query

Make `DENSITY` keyword optional and parallel to `GIVEN`

Schaechtle opened this issue · 0 comments

Overview

Why are we doing this?

This is the second "sprint" towards IQL-permissive (this issue describes the first). It will allow us to solve simpler sub-problems first.

Technical approach

We want IQL-permissive to work by translating from permissive ASTs to strict ASTs. If necessary, IQL-permissive can consult the schema and data tables and knows which column-variable is modeled by which model. Runtime errors are 100% acceptable.

The current implementation of IQL includes two different probability expressions. PROBABILITY OF expressions take distribution events and a model expression. PROBABILITY DENSITY OF expressions take density events and a model expression. IQL-permissive doesn't require the DENSITY keyword. PROBABILITY OF expressions take (i) either a distribution event or a density event and (ii) a model expression.

We've reached a point where we take a keyword in IQL-permissive to mean something different in IQL-strict/current. That means we need to think if this change merits a split (e.g. running two different grammars). ⚠️ TODO: link to full permissive spec doc here for reference and include a design of the split.⚠️

Examples

For the sake of readability, I am translating model expressions from query segments in permissive to query segments in strict and not ASTs to ASTs. We assume the following environment:

  • m is a model
  • d is a data table
  • foo, bar and baz are columns in d and also column variables in m.
  • the schema records that foo, bar are numerical, while baz is nominal.

The example model expressions below should be viewed as part of complete PROBABILITY OF queries i.e. queries like SELECT [probability expression] UNDER m FROM d

Example model expressions

Below, ➡️ means "translate AST of sub-query in IQL-permissive to IQL-strict", for example:

PROBABILITY OF VAR foo = foo ➡️ PROBABILITY DENSITY OF VAR foo = foo

Note: when an event could be either a distribution event or a density event, then it should be treated as a density event, as these can be more efficiently handled by the backend.
PROBABILITY OF VAR baz = baz ➡️ PROBABILITY DENSITY OF VAR baz = baz

PROBABILITY OF can take ,. , always means AND. While this may seem redundant, having both is necessary to ensure certain permissive queries to read clearly to outside users with little background in either programming or probability:
PROBABILITY OF VAR foo = foo, VAR baz = baz ➡️ PROBABILITY DENSITY OF VAR foo = foo AND VAR baz = baz

With events where = is the only operator, mixing nominal and numerical variables is trivial:
PROBABILITY OF VAR foo = foo, VAR bar = bar ➡️ PROBABILITY DENSITY OF VAR foo = foo AND VAR bar = bar

Distribution events as inputs in IQL-permissive PROBABILITY OF are always translated to PROBABILITY OF in IQL-strict:
PROBABILITY OF VAR foo > foo, VAR bar = bar ➡️ PROBABILITY OF VAR foo > foo AND VAR bar = bar

The following will result in a runtime error in IQL-permissive:
PROBABILITY OF VAR foo > foo AND VAR bar = bar ➡️ PROBABILITY OF VAR foo > foo AND VAR bar = bar 💥ERROR💥

OR is supported, too:
PROBABILITY OF VAR foo > foo OR VAR baz = baz ➡️ PROBABILITY OF VAR foo > foo OR VAR baz = baz

Non-goals

The following features for IQL-permissive will be tackled during later sprints:

  • Removing the VARkeyword.
  • Translating GIVEN foo into GIVEN VAR foo=foo.
  • Changing the order of GIVEN - i.e. the ability to write PROBABILITY OF foo GIVEN bar UNDER model instead of PROBABILITY OF foo UNDER model GIVEN bar
  • Nesting of GIVEN is not strictly required.

Other non-goals for now (which might become important later)

  • IQL-permissive does not need to ensure useful error messages are thrown.
  • We'll assume one schema (i.e. a single mapping from column to stattype). In the future, different models may support different schemas.

Open issues

I (Ulli) should create a complete spec for IQL permissive that Zane can work off of, that can be extended to issues like this.