Arbitrary dependency expression
Matts966 opened this issue · 4 comments
Current AlphaSQL (v0.6.0) can't simply handle arbitrary dependency expression for the sake of its simplicity (only dependency between table references and create table statements). #2
For example, we can't simply insert values into some tables before selecting from them if there are not create table statements.
There are 4 ways to solve this problem.
- We can use a table for dependency by creating the table in a SQL file we want to process early and selecting from it in a SQL file we want to process after. We can use functions after #4 is implemented.
- We can separate dag and run them serially.
- We can serialize all the queries by SQL script, but it's not efficient.
- We can implement a new flag to resolve arbitrary dependencies, but it will be complicated.
We can already use 1 and it's simple, I think.
Example implementation is in ./samples/sample-arbitrary-dependency-graph-with-drop-statement.
If someone wants to use the dependency resolution written in the top of #2, the below commands install the AlphaSQL version 0.5.
# To install for MacOSX
temp=$(mktemp -d)
wget -P $temp https://github.com/Matts966/alphasql/releases/download/v0.5.2/alphasql_darwin_amd64.tar.gz \
&& tar -zxvf $temp/alphasql_darwin_amd64.tar.gz -C /usr/local/bin --strip=1
Also, you can use Docker.
docker run --rm -v `pwd`:/home matts966/alphasql:v0.5.2 [command]
Note that users should follow the complex rules as follows to prevent cycles in extracted graphs.
- When adding a SQL file, users should write statements in the order of CREATE, INSERT, UPDATE, and other table references on a table. Users who wants to use a different order should write a script or use dummy tables written above.
- Users should avoid creating the same table names more than once.
- When modifying SQL files, users should care about the cycles in the extracted graphs. Cycles are warned and type checker reports an error, but other types of cycles such as cycles between INSERT statements are possible.
We are now planning on new warning for INSERT
and UPDATE
if the target table is not created in the same script file. By limiting those side effects only in the same script, we can prevent undefined behavior, specify clearer way to build the target table and use more aggressive caching.
Also, we plan an option --warning-as-error
to make warnings errors.
If we limit side effects only in the same script, we can't update or insert into already existing external tables. However, those tables (often called lake tables) should not be updated or inserted to make processes idempotent. So we think --warning-as-error
is the correct way and maybe set it true
by default in the future, but users can set arbitrary dependencies as already noted.