Non-local variables from let clauses are missing in subsequent for clauses
wzrain opened this issue · 3 comments
Issue:
As is described in the title, for example:
let $i := parallelize((
{ "commits" : [ { "author" : "Einstein" } ], "repo":"r2"},
{ "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ], "repo": "r1"}
)) for $j in parallelize(({"id": 1}, {"id": 2}))
where $j.id < count($i.repo)
return $j
results in
[ERROR] An error has occurred: There was an error on line 1 in none:
^
Code: [RBST0004]
Message: Expecting full variable dependency on i but column not found in the data frame.
Metadata: none:LINE:1:COLUMN:0:
This code can also be looked up in the documentation and specifications for more information.
We should investigate this 🙈. Please contact us or file an issue on GitHub with your query.
Link: https://github.com/RumbleDB/rumble/issues
Possible reason:
The execution should flow through the getDataFrameFromUnion()
function here: https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/runtime/flwor/clauses/ForClauseSparkIterator.java#L498. Inside that function however, it seems only local variables from child clauses are preserved: https://github.com/RumbleDB/rumble/blob/master/src/main/java/org/rumbledb/runtime/flwor/clauses/ForClauseSparkIterator.java#L533. Looks like RDD/Dataframe variables are omitted. It would be great if you check it out to see whether this is the case. Thanks!
I investigated. This query is known not to work but the error message deserved a better version.
RumbleDB now detects more proactively when a nested job attempts a parallel execution.
@ghislainfourny Thanks for the comments. Indeed technically this is a "job in a job" query, so it does make sense to directly throw an error. Although I just think that some expressions like count
in the query I mentioned above will help avoid the nested job, because they can just return a long variable instead of a gigantic rdd/df variable.
@ghislainfourny Actually you are right, since the query can be rewritten to a working one:
let $i := count(parallelize((
{ "commits" : [ { "author" : "Einstein" } ], "repo":"r2"},
{ "commits" : [ { "author" : "Goedel" }, { "author" : "Ramanujan" } ], "repo": "r1"})).repo)
for $j in parallelize(({"id": 1}, {"id": 2}))
where $j.id < $i
return $j
So indeed a clearer error message in your fix is enough I think. I will close the issue. :)