Support INSERT and UPDATE with SELECT
jsign opened this issue · 7 comments
We've talked about supporting INSERT and UPDATEs that rely on subqueries.
Leaving here some things I've talked to multiple people to be aware of:
- All the subqueries (if we want to support more than one?) should be deterministic, i.e: probably have an
ORDER BY
by withINTEGER PRIMARY KEY AUTOINCREMENT
. This is needed to know that all the INSERT/UPDATE execution is deterministic. - We'd have to think about limiting the number of sub-queries and where they can appear.
- We need to define which tabes you can do SELECTs (i.e: it seems pretty obvious to me that can only be tables from your same chain)
- We need to think about what's the worst thing that could happen regarding DoS/security. (e.g: what if someone INSERTs in a table all the data from all tables in that chain?)
- We need to think about if SELECTs can be nested.
- We need to think about how flexible we want SELECTs to be, or focus on a really concrete structure (e.g: having clauses? group by?, etc)
The list above is probably incomplete, I'm just leaving some thoughts here.
Note that the INTEGER PRIMARY KEY AUTOINCREMENT
enforcement can't be done in the parser, since the parser doesn't know about the table schema. This is done at the execution step by validators.
Cross-reference: tablelandnetwork/go-tableland#362
Going to start working on this. Given that we solved the rowid
non-determinism issue, I'm going to share my current thinking about this feature and try to offer some thoughts to @jsign questions.
Determinism
The first thing is how we guarantee a deterministic execution of an INSERT + SELECT
. I'm going to assume an UPDATE + SELECT
is deterministic by the nature of the statement. Here's a simple example to illustrate the problem with INSERT + SELECT
:
sqlite> create table t (a int);
sqlite> insert into t SELECT * from (select 1 UNION select 2) order by random();
sqlite> insert into t SELECT * from (select 1 UNION select 2) order by random();
sqlite> insert into t SELECT * from (select 1 UNION select 2) order by random();
sqlite> select rowid, a from t;
1|2
2|1
3|2
4|1
5|1
6|2
You can see that if we don't have a deterministic select we can insert rows out of order. That means we need to force a deterministic ORDER BY
in the SELECT
.
Now that we are forcing AUTOINCREMENT
on every INTEGER PRIMARY KEY
column and that column is an alias to rowid
, we can force an ORDER BY rowid
in the SELECT of an INSERT + SELECT
.
Ok, that sounds like a good solution. However, when you take into account subqueries and compound selects, a rowid
may not be present (even if it is, you can have duplicated rowids). For example, SELECT * FROM (SELECT a FROM t)
or SELECT a FROM t UNION SELECT a FROM t
does not have a rowid
, which means we cannot allow subqueries, nested queries, and compound queries. We can only allow flattened selects that fetch data directly from a table. That answers some of @jsign points.
GROUP BY
and HAVING
GROUP BY
and HAVING
is a bit tricky. I'm not sure if they behave deterministically. Example:
sqlite> create table t2 (a int, b text);
sqlite> insert into t2 values (1, 'a'), (2, 'a'), (4, 'b');
sqlite> select b, sum(a) FROM t2 group by b;
a|3
b|4
sqlite> select b, sum(a) FROM t2 group by b;
a|3
b|4
sqlite> select b, sum(a) FROM t2 group by b order by rowid;
a|3
b|4
Not sure what the order by rowid
in that case means.
We can disallow them until we learn more about them.
From which tables can we SELECT
?
I think the main issue is if the SELECT
references tables that others have write access. Suppose I will execute an INSERT + SELECT
on table X that others have write access, and right before I execute that statement someone else execute a DELETE
or UPDATE
. I will get a different result from the one I expected, I think. Makes sense? Not sure we want to restrict that to only tables the user has access to.
For safety, we can start by restricting only to tables of the same chain. And we learn from there.
About DoS and JOINs
I think we could restrict the number of JOINs
one can make. That would be good in general. SQLite
already has a limit:
SQLite does not support joins containing more than 64 tables. This limit arises from the fact that the SQLite code generator uses bitmaps with one bit per join-table in the query optimizer.
But maybe we can restrict even more. Maybe 10? I think the need for a lot of JOINs come from analytical queries. I don't think we are focusing on analytical use cases yet in Tableland.
By restricting JOINs
and disallowing UNION
we are able to limit the radius of a DoS attack.
lmk your thoughts @carsonfarmer @jsign
We can only allow flattened selects that fetch data directly from a table. That answers some of @jsign points.
Agree.
Not sure what the
order by rowid
in that case means.We can disallow them until we learn more about them.
Huh, that's very weird. My gut reaction is that order by rowid
shouldn't work there since it isn't part of the group by
, but it looks like it does.
Agree on avoiding group by
here.
From which tables can we
SELECT
?I think the main issue is if the
SELECT
references tables that others have write access. Suppose I will execute anINSERT + SELECT
on table X that others have write access, and right before I execute that statement someone else execute aDELETE
orUPDATE
. I will get a different result from the one I expected, I think. Makes sense? Not sure we want to restrict that to only tables the user has access to.
I think I understand your point, but I'm not sure we should consider that a problem. The same happens in a simple situation like a write-query UPDATE xxxxx SET a=1
. Technically speaking, any other person having to write access to xxxx
could have sent another write-query that would make the intention of UPDATE xxxx SET a=1
not be what that person expected.
For safety, we can start by restricting only to tables of the same chain. And we learn from there.
To be clear, I'm not sure that conclusion relates to your previous point (please correct me if I misunderstood). But in any case, this is an invariant that we must enforce. We can't ever allow a write-query to mix tables from different chains since validators can sync them at different speeds.
About DoS and
JOINs
I think we could restrict the number of
JOINs
one can make. That would be good in general.SQLite
already has a limit:SQLite does not support joins containing more than 64 tables. This limit arises from the fact that the SQLite code generator uses bitmaps with one bit per join-table in the query optimizer.
But maybe we can restrict even more. Maybe 10? I think the need for a lot of JOINs come from analytical queries. I don't think we are focusing on analytical use cases yet in Tableland.
By restricting
JOINs
and disallowingUNION
we are able to limit the radius of a DoS attack.
Uhm, I'll derail from your point a bit. If we allow joins, the rowid
to be used in the order by
might be a bit complicated.
If we join 3 tables, we'd have to be sure that the rowid
used in the order by
doesn't have repeated rows in the JOIN result. If that's the case, we could still have nondeterministic order if we don't force more rowids
.
Makes some sense?
I think I understand your point, but I'm not sure we should consider that a problem. The same happens in a simple situation like a write-query UPDATE xxxxx SET a=1. Technically speaking, any other person having to write access to xxxx could have sent another write-query that would make the intention of UPDATE xxxx SET a=1 not be what that person expected.
Oh yeah. True.
To be clear, I'm not sure that conclusion relates to your previous point (please correct me if I misunderstood). But in any case, this is an invariant that we must enforce. We can't ever allow a write-query to mix tables from different chains since validators can sync them at different speeds.
Yeah. It was not related.
Uhm, I'll derail from your point a bit. If we allow joins, the rowid to be used in the order by might be a bit complicated.
If we join 3 tables, we'd have to be sure that the rowid used in the order by doesn't have repeated rows in the JOIN result. If that's the case, we could still have nondeterministic order if we don't force more rowids.
Makes some sense?
Good point! Looks like the JOIN
falls in the same case as the UNION
and subqueries (you lose access to rowid
along the way):
sqlite> select t.* from t, t2 ORDER BY rowid;
Parse error: no such column: rowid
select t.* from t, t2 ORDER BY rowid;
error here ---^
Going to rectify something I said above "I'm going to assume an UPDATE + SELECT
is deterministic by the nature of the statement.". That is only true if there is a one-to-one mapping between the target table and the FROM clause results. From docs:
If the join between the target table and the FROM clause results in multiple output rows for the same target table row, then only one of those output rows is used for updating the target table. The output row selected is arbitrary and might change from one release of SQLite to the next, or from one run to the next.
That would be very hard to check in the parser. So, not sure how we would move forward with UPDATE + SELECT
.
Summary of the above discussion
INSERT
+ SELECT
- Cannot support
UNION
s,JOIN
s, and subqueries (only flattenedSELECT
s with direct table access) - Only allow references to tables in the same chain
- For now, block
HAVING
andGROUP BY
until we understand how they behave better - Gotta force an
ORDER BY
rowid
This is pretty limiting but we still would be able to support the data migration/copy use case, which is a big win:
INSERT INTO table2 (column1, column2, column3, ...)
SELECT column1, column2, column3, ...
FROM table1
WHERE condition;
UPDATE
+ SELECT
I'm afraid we cannot support it because of the following problem
If the join between the target table and the FROM clause results in multiple output rows for the same target table row, then only one of those output rows is used for updating the target table. The output row selected is arbitrary and might change from one release of SQLite to the next, or from one run to the next.
Checking if the query would fall in that case would be pretty hard.
Ok, awesome, and thanks for this summary. Yes this still seems worth it, especially for the ability to copy table content etc. This is actually a bit of a super power that is actually a bit of a differentiator for Tableland over other blockchain storage solutions in some respect.