SIMILARITY should require exactly 1 WITH RESPECT TO variable
Closed this issue · 1 comments
fsaad commented
From @versar
@vkm tells me that `ESTIMATE SIMILARITY FROM PAIRWISE <population_name>` is not
a valid expression, and that `ESTIMATE SIMILARITY WITH RESPECT TO
<multiple_variable_names> FROM PAIRWISE <population_name>` is also not
correct. He says that the only valid expression is `ESTIMATE SIMILARITY
WITH RESPECT TO <one_variable_name> FROM PAIRWISE <population_name>`.
However, many people, like me, run the first two expression(s) assuming
they are calculating some similarity metric with respect to the entire set
of dependencies, or with respect to a subset of variables.
From @vkmvkmvkmvkm
IIRC the CrossCat paper actually doesn't define ``similarity'' except WRT
exactly 1 variable, and the heuristics implemented in bayeslite are all fairly
arbitrary. the actual problem is we've seen plots of outputs from these
arbitrary heuristics that don't have a principled rationale in terms of a query
against a GPM, but seem suggestive or tempting to build on. and that creates all
kinds of problems.
fsaad commented
Originally, the goal was for the parser to fail in:
https://github.com/probcomp/bayeslite/blob/master/src/parse.py#L612-L613
but that won't work, since the WITH RESPECT TO (expression)
columns can be the output of a subquery expression
which the parser does not know how many rows it includes.
The compiler happens to execute the the expression
in and we can determine the number of columns over there, but it's a bad idea because compile_column_lists
is a general function which may be used by clients other than SIMILARITY
.
https://github.com/probcomp/bayeslite/blob/master/src/compiler.py#L1255-L1280
So the simplest solution is to perform the check in bqlfn.bql_row_similarity
.