OpenGen/GenSQL.query

Replace `CONSTRAINED BY` and `CONDITIONED BY` with `GIVEN`

Closed this issue · 1 comments

Overview

Why are we doing this?

This is the first "sprint" towards IQL-permissive. It will allow us to solve simpler sub-problems first.

Technical approach

We want IQL-permissive to work by translating from permissive ASTs to strict ASTs. To do that, IQL permissive will not be context free. It can consult the schema and data tables and knows which column-variable is modeled by which model.

This is necessary for this PR, too.

Sometimes, GIVEN statements will have to be translated into strict ASTs that encode a model expression with both CONSTRAINED BY and CONDITIONED BY. Another way to think about this is that CONDITIONED BY takes density events, CONSTRAINED BY takes distribution events but GIVEN takes both and events that are conjunctions (i.e. AND-linked lists) of both types of events.

Examples

For the sake of readability, I am translating model expressions from query segments in permissive to query segments in strict and not ASTs to ASTs. We assume the following environment:

  • m is a model
  • d is data table
  • foo, bar and baz are a columns in d and als column variables in m.
  • the schema records that foo, bar are numerical, while baz is nominal.

The example model expressions below should be viewed as part of PROBABILITY OF queries i.e. queries like SELECT PROBABILITY OF VAR foo = foo UNDER [model expression] FROM data. But the logic applies equally to GENERATE.

Example model expressions

Below, ➡️ means "translate AST of sub-query in IQL-permissive to IQL-strict", for example:

UNDER m GIVEN VAR foo = foo ➡️ under m CONDITIONED BY VAR foo = foo

Note: when an event could be either a distribution event or a density event, then it should be treated as a density event, as these can be more efficiently handled by the backend.
UNDER m GIVEN VAR baz = baz ➡️ under m CONDITIONED BY VAR baz = baz

GIVEN can take ,. , always means AND. While this may seem redundant, having both is necessary to ensure certain permissive queries to read clearly to outside users with little background in either programming or probability:
UNDER m GIVEN VAR foo = foo, VAR bar = bar ➡️ under m CONDITIONED BY VAR foo = foo AND VAR bar = bar

With pure density events, mixing nominal and numerical variables is trivial:
UNDER m GIVEN VAR foo = foo, VAR baz = baz ➡️ under m CONDITIONED BY VAR foo = foo AND VAR baz = baz

AND also works with GIVEN:
UNDER m GIVEN VAR foo = foo AND VAR bar = bar ➡️ under m CONDITIONED BY VAR foo = foo AND VAR bar = bar

Distribution events in GIVEN expressions are translated to CONSTRAINED BY expressions:
UNDER m GIVEN VAR foo > foo ➡️ under m CONSTRAINED BY VAR foo > foo

This events can be mix-and-matched, but my require chaining of CONSTRAINED BY and CONDITIONED BY:
UNDER m GIVEN VAR foo > foo AND VAR bar = bar ➡️ (under m CONDITIONED BY VAR bar = bar) CONSTRAINED BY VAR foo > foo

It's prefered to not do that, though:
UNDER m GIVEN VAR foo > foo AND VAR baz = baz ➡️ under m CONSTRAINED BY VAR foo > foo AND VAR baz = baz
This last model expression could also be translated into an equivalent expression chaining CONSTRAINED BY and CONDITIONED BY.

Non-goals

The following features for IQL-permissive will be tackled during later sprints:

  • Removing the VARkeyword.
  • Translating GIVEN foo into GIVEN VAR foo=foo.
  • Making the DENSITY optional (this is related to this PR though; because of the parallelisms:PROBABILITY and CONSTRAINED BY are both taking a distribution event and PROBABILITY DENSITY and CONDITIONED BY both taking a density event.
  • Changing the order of GIVEN - i.e. the ability to write PROBABILITY OF foo GIVEN bar UNDER model instead of PROBABILITY OF foo UNDER model GIVEN bar
  • Nesting of GIVEN is not strictly required.

Other non-goals for now (which might become important later)

  • IQL-permissive does not need to ensure useful error messages are thrown.
  • We'll assume one schema (i.e. a single mapping from column to stattype). In the future, different models may support different schemas.

Open issues

I (Ulli) should create a complete spec for IQL permissive that Zane can work off of, that can be extended to issues like this.

Reminder for me: this needs to talk about OR!