Non-local variables from let clauses are missing in subsequent for clauses

Question

Non-local variables from let clauses are missing in subsequent for clauses

wzrain opened this issue 3 years ago · 3 comments

Issue:
As is described in the title, for example:

let $i :=  parallelize((
     { "commits" : [ { "author" : "Einstein" } ], "repo":"r2"},
     { "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ], "repo": "r1"}
)) for $j in parallelize(({"id": 1}, {"id": 2})) 
where $j.id < count($i.repo) 
return $j

results in

[ERROR] An error has occurred: There was an error on line 1 in none:


^

Code: [RBST0004]
Message: Expecting full variable dependency on i but column not found in the data frame.
Metadata: none:LINE:1:COLUMN:0:
This code can also be looked up in the documentation and specifications for more information.

We should investigate this 🙈. Please contact us or file an issue on GitHub with your query. 
Link: https://github.com/RumbleDB/rumble/issues

Possible reason:
The execution should flow through the getDataFrameFromUnion() function here: https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/runtime/flwor/clauses/ForClauseSparkIterator.java#L498. Inside that function however, it seems only local variables from child clauses are preserved: https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/runtime/flwor/clauses/ForClauseSparkIterator.java#L533. Looks like RDD/Dataframe variables are omitted. It would be great if you check it out to see whether this is the case. Thanks!

Answer 1 · 2022-02-07T12:48:41.000Z

I investigated. This query is known not to work but the error message deserved a better version.

RumbleDB now detects more proactively when a nested job attempts a parallel execution.

#1170

Answer 2 · 2022-02-07T15:34:55.000Z

@ghislainfourny Thanks for the comments. Indeed technically this is a "job in a job" query, so it does make sense to directly throw an error. Although I just think that some expressions like count in the query I mentioned above will help avoid the nested job, because they can just return a long variable instead of a gigantic rdd/df variable.

Answer 3 · 2022-02-08T10:03:51.000Z

@ghislainfourny Actually you are right, since the query can be rewritten to a working one:

let $i :=  count(parallelize((
     { "commits" : [ { "author" : "Einstein" } ], "repo":"r2"},
     { "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ], "repo": "r1"})).repo) 
for $j in parallelize(({"id": 1}, {"id": 2})) 
where $j.id < $i
return $j

So indeed a clearer error message in your fix is enough I think. I will close the issue. :)