NVIDIA/spark-rapids

[AUDIT] [SPARK-49743][SQL] OptimizeCsvJsonExpr should not change schema fields when pruning GetArrayStructFields

Closed this issue · 1 comments

apache/spark@a4fb6cbfda2

This PR affects the from_json operator and at least we need to test the behavior on the plugin.

SELECT
  from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array<struct<a: INT, b: INT>>').a,
  from_json('[{"a": '||id||', "b": '|| (2*id) ||'}]', 'array<struct<a: INT, b: INT>>').A
FROM
  range(3) as t

Earlier, the result would had been:

Array([ArraySeq(0),ArraySeq(null)], [ArraySeq(1),ArraySeq(null)], [ArraySeq(2),ArraySeq(null)])

vs the new result is (verified through spark-shell):

Array([ArraySeq(0),ArraySeq(0)], [ArraySeq(1),ArraySeq(1)], [ArraySeq(2),ArraySeq(2)])

I just looked at this a bit more deeply, and this is a bug in a logical plan optimization in Spark. What is more we don't support top level arrays in from_json yet, so this does not impact us at all.