[AUDIT] [SPARK-49743][SQL] OptimizeCsvJsonExpr should not change schema fields when pruning GetArrayStructFields
Closed this issue · 1 comments
amahussein commented
This PR affects the from_json
operator and at least we need to test the behavior on the plugin.
SELECT
from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array<struct<a: INT, b: INT>>').a,
from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array<struct<a: INT, b: INT>>').A
FROM
range(3) as t
Earlier, the result would had been:
Array([ArraySeq(0),ArraySeq(null)], [ArraySeq(1),ArraySeq(null)], [ArraySeq(2),ArraySeq(null)])
vs the new result is (verified through spark-shell):
Array([ArraySeq(0),ArraySeq(0)], [ArraySeq(1),ArraySeq(1)], [ArraySeq(2),ArraySeq(2)])
revans2 commented
I just looked at this a bit more deeply, and this is a bug in a logical plan optimization in Spark. What is more we don't support top level arrays in from_json yet, so this does not impact us at all.