tablelandnetwork/go-sqlparser

Support INSERT and UPDATE with SELECT

jsign opened this issue · 7 comments

jsign commented

We've talked about supporting INSERT and UPDATEs that rely on subqueries.

Leaving here some things I've talked to multiple people to be aware of:

  • All the subqueries (if we want to support more than one?) should be deterministic, i.e: probably have an ORDER BY by with INTEGER PRIMARY KEY AUTOINCREMENT. This is needed to know that all the INSERT/UPDATE execution is deterministic.
  • We'd have to think about limiting the number of sub-queries and where they can appear.
  • We need to define which tabes you can do SELECTs (i.e: it seems pretty obvious to me that can only be tables from your same chain)
  • We need to think about what's the worst thing that could happen regarding DoS/security. (e.g: what if someone INSERTs in a table all the data from all tables in that chain?)
  • We need to think about if SELECTs can be nested.
  • We need to think about how flexible we want SELECTs to be, or focus on a really concrete structure (e.g: having clauses? group by?, etc)

The list above is probably incomplete, I'm just leaving some thoughts here.

cc @sanderpick @carsonfarmer @brunocalza

jsign commented

Note that the INTEGER PRIMARY KEY AUTOINCREMENT enforcement can't be done in the parser, since the parser doesn't know about the table schema. This is done at the execution step by validators.

Cross-reference: tablelandnetwork/go-tableland#362

Going to start working on this. Given that we solved the rowid non-determinism issue, I'm going to share my current thinking about this feature and try to offer some thoughts to @jsign questions.

Determinism

The first thing is how we guarantee a deterministic execution of an INSERT + SELECT. I'm going to assume an UPDATE + SELECT is deterministic by the nature of the statement. Here's a simple example to illustrate the problem with INSERT + SELECT:

sqlite> create table t (a int);
sqlite> insert into t SELECT * from (select 1 UNION select 2) order by random();
sqlite> insert into t SELECT * from (select 1 UNION select 2) order by random();
sqlite> insert into t SELECT * from (select 1 UNION select 2) order by random();
sqlite> select rowid, a from t;
1|2
2|1
3|2
4|1
5|1
6|2

You can see that if we don't have a deterministic select we can insert rows out of order. That means we need to force a deterministic ORDER BY in the SELECT.

Now that we are forcing AUTOINCREMENT on every INTEGER PRIMARY KEY column and that column is an alias to rowid, we can force an ORDER BY rowid in the SELECT of an INSERT + SELECT.

Ok, that sounds like a good solution. However, when you take into account subqueries and compound selects, a rowid may not be present (even if it is, you can have duplicated rowids). For example, SELECT * FROM (SELECT a FROM t) or SELECT a FROM t UNION SELECT a FROM t does not have a rowid, which means we cannot allow subqueries, nested queries, and compound queries. We can only allow flattened selects that fetch data directly from a table. That answers some of @jsign points.

GROUP BY and HAVING

GROUP BY and HAVING is a bit tricky. I'm not sure if they behave deterministically. Example:

sqlite> create table t2 (a int, b text);
sqlite> insert into t2 values (1, 'a'), (2, 'a'), (4, 'b');
sqlite> select b, sum(a) FROM t2 group by b;
a|3
b|4
sqlite> select b, sum(a) FROM t2 group by b;
a|3
b|4
sqlite> select b, sum(a) FROM t2 group by b order by rowid;
a|3
b|4

Not sure what the order by rowid in that case means.

We can disallow them until we learn more about them.

From which tables can we SELECT?

I think the main issue is if the SELECT references tables that others have write access. Suppose I will execute an INSERT + SELECT on table X that others have write access, and right before I execute that statement someone else execute a DELETE or UPDATE. I will get a different result from the one I expected, I think. Makes sense? Not sure we want to restrict that to only tables the user has access to.

For safety, we can start by restricting only to tables of the same chain. And we learn from there.

About DoS and JOINs

I think we could restrict the number of JOINs one can make. That would be good in general. SQLite already has a limit:

SQLite does not support joins containing more than 64 tables. This limit arises from the fact that the SQLite code generator uses bitmaps with one bit per join-table in the query optimizer.

But maybe we can restrict even more. Maybe 10? I think the need for a lot of JOINs come from analytical queries. I don't think we are focusing on analytical use cases yet in Tableland.

By restricting JOINs and disallowing UNION we are able to limit the radius of a DoS attack.

lmk your thoughts @carsonfarmer @jsign

jsign commented

We can only allow flattened selects that fetch data directly from a table. That answers some of @jsign points.

Agree.

Not sure what the order by rowid in that case means.

We can disallow them until we learn more about them.

Huh, that's very weird. My gut reaction is that order by rowid shouldn't work there since it isn't part of the group by, but it looks like it does.
Agree on avoiding group by here.

From which tables can we SELECT?

I think the main issue is if the SELECT references tables that others have write access. Suppose I will execute an INSERT + SELECT on table X that others have write access, and right before I execute that statement someone else execute a DELETE or UPDATE. I will get a different result from the one I expected, I think. Makes sense? Not sure we want to restrict that to only tables the user has access to.

I think I understand your point, but I'm not sure we should consider that a problem. The same happens in a simple situation like a write-query UPDATE xxxxx SET a=1. Technically speaking, any other person having to write access to xxxx could have sent another write-query that would make the intention of UPDATE xxxx SET a=1 not be what that person expected.

For safety, we can start by restricting only to tables of the same chain. And we learn from there.

To be clear, I'm not sure that conclusion relates to your previous point (please correct me if I misunderstood). But in any case, this is an invariant that we must enforce. We can't ever allow a write-query to mix tables from different chains since validators can sync them at different speeds.

About DoS and JOINs

I think we could restrict the number of JOINs one can make. That would be good in general. SQLite already has a limit:

SQLite does not support joins containing more than 64 tables. This limit arises from the fact that the SQLite code generator uses bitmaps with one bit per join-table in the query optimizer.

But maybe we can restrict even more. Maybe 10? I think the need for a lot of JOINs come from analytical queries. I don't think we are focusing on analytical use cases yet in Tableland.

By restricting JOINs and disallowing UNION we are able to limit the radius of a DoS attack.

Uhm, I'll derail from your point a bit. If we allow joins, the rowid to be used in the order by might be a bit complicated.
If we join 3 tables, we'd have to be sure that the rowid used in the order by doesn't have repeated rows in the JOIN result. If that's the case, we could still have nondeterministic order if we don't force more rowids.
Makes some sense?

I think I understand your point, but I'm not sure we should consider that a problem. The same happens in a simple situation like a write-query UPDATE xxxxx SET a=1. Technically speaking, any other person having to write access to xxxx could have sent another write-query that would make the intention of UPDATE xxxx SET a=1 not be what that person expected.

Oh yeah. True.

To be clear, I'm not sure that conclusion relates to your previous point (please correct me if I misunderstood). But in any case, this is an invariant that we must enforce. We can't ever allow a write-query to mix tables from different chains since validators can sync them at different speeds.

Yeah. It was not related.

Uhm, I'll derail from your point a bit. If we allow joins, the rowid to be used in the order by might be a bit complicated.
If we join 3 tables, we'd have to be sure that the rowid used in the order by doesn't have repeated rows in the JOIN result. If that's the case, we could still have nondeterministic order if we don't force more rowids.
Makes some sense?

Good point! Looks like the JOIN falls in the same case as the UNION and subqueries (you lose access to rowid along the way):

sqlite> select t.* from t, t2 ORDER BY rowid;
Parse error: no such column: rowid
  select t.* from t, t2 ORDER BY rowid;
                   error here ---^

Going to rectify something I said above "I'm going to assume an UPDATE + SELECT is deterministic by the nature of the statement.". That is only true if there is a one-to-one mapping between the target table and the FROM clause results. From docs:

If the join between the target table and the FROM clause results in multiple output rows for the same target table row, then only one of those output rows is used for updating the target table. The output row selected is arbitrary and might change from one release of SQLite to the next, or from one run to the next.

That would be very hard to check in the parser. So, not sure how we would move forward with UPDATE + SELECT.

Summary of the above discussion

INSERT + SELECT

  • Cannot support UNIONs, JOINs, and subqueries (only flattened SELECTs with direct table access)
  • Only allow references to tables in the same chain
  • For now, block HAVING and GROUP BY until we understand how they behave better
  • Gotta force an ORDER BY rowid

This is pretty limiting but we still would be able to support the data migration/copy use case, which is a big win:

INSERT INTO table2 (column1, column2, column3, ...)
SELECT column1, column2, column3, ...
FROM table1
WHERE condition;

UPDATE+ SELECT

I'm afraid we cannot support it because of the following problem

If the join between the target table and the FROM clause results in multiple output rows for the same target table row, then only one of those output rows is used for updating the target table. The output row selected is arbitrary and might change from one release of SQLite to the next, or from one run to the next.

Checking if the query would fall in that case would be pretty hard.

Ok, awesome, and thanks for this summary. Yes this still seems worth it, especially for the ability to copy table content etc. This is actually a bit of a super power that is actually a bit of a differentiator for Tableland over other blockchain storage solutions in some respect.