probcomp/bayeslite

SIMILARITY should require exactly 1 WITH RESPECT TO variable

Closed this issue · 1 comments

fsaad commented

From @versar

@vkm tells me that `ESTIMATE SIMILARITY FROM PAIRWISE <population_name>` is not
a valid expression, and that `ESTIMATE SIMILARITY WITH RESPECT TO
<multiple_variable_names> FROM PAIRWISE <population_name>`  is also not
correct. He says that the only valid expression is `ESTIMATE SIMILARITY
WITH RESPECT TO <one_variable_name> FROM PAIRWISE <population_name>`.
However, many people, like me, run the first two expression(s) assuming
they are calculating some similarity metric with respect to the entire set
of dependencies, or with respect to a subset of variables.

From @vkmvkmvkmvkm

IIRC the CrossCat paper actually doesn't define ``similarity'' except WRT
exactly 1 variable, and the heuristics implemented in bayeslite are all fairly
arbitrary. the actual problem is we've seen plots of outputs from these
arbitrary heuristics that don't have a principled rationale in terms of a query
against a GPM, but seem suggestive or tempting to build on. and that creates all
kinds of problems.
fsaad commented

Originally, the goal was for the parser to fail in:
https://github.com/probcomp/bayeslite/blob/master/src/parse.py#L612-L613

but that won't work, since the WITH RESPECT TO (expression) columns can be the output of a subquery expression which the parser does not know how many rows it includes.

The compiler happens to execute the the expression in and we can determine the number of columns over there, but it's a bad idea because compile_column_lists is a general function which may be used by clients other than SIMILARITY.
https://github.com/probcomp/bayeslite/blob/master/src/compiler.py#L1255-L1280

So the simplest solution is to perform the check in bqlfn.bql_row_similarity.