delta-io/connectors

Question about AddFile dataChange flag.

horizonzy opened this issue · 2 comments

A question about flag dataChange in the AddFile , the flag dataChange in txn log is true . But in the program, the value is false . Is it a bug?

In the tnx log:

~: cat 00000000000000000000.json
{"protocol":{"minReaderVersion":1,"minWriterVersion":2}}
{"metaData":{"id":"91a0d9a7-952a-42a8-abdf-73cbf00b1849","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"id\",\"type\":\"long\",\"nullable\":true,\"metadata\":{}}]}","partitionColumns":[],"configuration":{},"createdTime":1662000777202}}
{"add":{"path":"part-00000-f7f493e9-7155-4171-91ac-7e046e31c269-c000.snappy.parquet","partitionValues":{},"size":296,"modificationTime":1662000778449,"dataChange":true,"stats":"{\"numRecords\":0,\"minValues\":{},\"maxValues\":{},\"nullCount\":{}}"}}
{"add":{"path":"part-00001-6d4b2c01-6d87-4aa3-a9a5-bb315cd665e8-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1662000778615,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":0},\"nullCount\":{\"id\":0}}"}}
{"add":{"path":"part-00003-2bfd96d1-aaf5-4cc5-930b-b59d12d17ea9-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1662000778615,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":1},\"maxValues\":{\"id\":1},\"nullCount\":{\"id\":0}}"}}
{"add":{"path":"part-00005-873e9693-7c39-4419-87de-0c27d9f64b37-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1662000778615,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":2},\"maxValues\":{\"id\":2},\"nullCount\":{\"id\":0}}"}}
{"add":{"path":"part-00007-8154cb0e-84b9-4009-9eb2-17ed532a8c82-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1662000778615,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":3},\"maxValues\":{\"id\":3},\"nullCount\":{\"id\":0}}"}}
{"add":{"path":"part-00009-34c9adf2-b652-4450-8b26-dee491ae1ab5-c000.snappy.parquet","partitionValues":{},"size":478,"modificationTime":1662000778615,"dataChange":true,"stats":"{\"numRecords\":1,\"minValues\":{\"id\":4},\"maxValues\":{\"id\":4},\"nullCount\":{\"id\":0}}"}}
{"commitInfo":{"timestamp":1662000778641,"operation":"WRITE","operationParameters":{"mode":"ErrorIfExists","partitionBy":"[]"},"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"6","numOutputRows":"5","numOutputBytes":"2686"},"engineInfo":"Apache-Spark/3.2.1 Delta-Lake/1.2.0","txnId":"eb52cd3e-4e5b-4d1f-8c91-27197f59f74f"}}

In the program:
image

I notice the code, it make dataChange to false forcefully. Is there some particular cases?
image (1)

I don't ~think it's a bug. Since you're referencing the Standalone library, can you please open this issue in the connectors repo?

FWIW the same behavior is here too so we can leave this open and resolve this when there's a solid answer.

Hi @horizonzy, the dataChange flag is only meaningful when looking at the actions added in a specific version (or the actions within a single commit) but not when looking at all the AddFiles in a snapshot. AFAIK here we just set dataChange=false to canonicalize the actions