sadikovi/riff

Potentially corrupt file level statistics for UTF8

Closed this issue · 2 comments

Log in terminal showed some weird min/max values for file statistics. Looks like this is only shown when dataset is repartitioned, statistics look correct when dataset is not repartitioned.

This is strange since statistics are updated on Spark internal rows, not our indexed representation. It is worth exploring.

There are also problems when file statistics discard scanning file that contains valid record, but when searching on integer id, it works; it also works when disabling predicate pushdown.

Looks like only the first stripe contains wrong statistics.

17/05/09 14:08:29 INFO FileReader: Read file statistics content of 334 bytes
17/05/09 14:08:29 INFO FileReader: Read file statistics INT[hasNulls=false, min=9, max=99999010]
17/05/09 14:08:29 INFO FileReader: Read file statistics LONG[hasNulls=false, min=9, max=99999010]
17/05/09 14:08:29 INFO FileReader: Read file statistics UTF8[hasNulls=false, min=abc10099454 abc10099454 abc10099454, max=abc99999010 abc99999010 abc99999010]
17/05/09 14:08:29 INFO FileReader: Read file statistics INT[hasNulls=false, min=9, max=99999010]
17/05/09 14:08:29 INFO FileReader: Read file statistics LONG[hasNulls=false, min=9, max=99999010]
17/05/09 14:08:29 INFO FileReader: Read file statistics UTF8[hasNulls=false, min=9 abc99009, max=abc99999010 abc99999010 abc99999010]
17/05/09 14:08:29 INFO FileReader: Read file statistics UTF8[hasNulls=false, min=abc10099454 abc10099454 abc10099454, max=abc99999010 abc99999010 abc99999010]
17/05/09 14:08:29 INFO FileReader: Read file statistics UTF8[hasNulls=false, min=9 abc99009, max=abc99999010 abc99999010 abc99999010]

17/05/09 14:08:29 INFO FileReader: Stripe statistics: INT[hasNulls=false, min=9, max=9999582]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: LONG[hasNulls=false, min=9, max=9999582]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: UTF8[hasNulls=false, min=abc1099256 abc1099256 abc1099256, max=abc9999582 abc9999582 abc9999582]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: INT[hasNulls=false, min=9, max=9999582]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: LONG[hasNulls=false, min=9, max=9999582]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: UTF8[hasNulls=false, min=9 abc99009, max=abc9999582 abc9999582 abc9999582]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: UTF8[hasNulls=false, min=abc1099256 abc1099256 abc1099256, max=abc9999582 abc9999582 abc9999582]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: UTF8[hasNulls=false, min=9 abc99009, max=abc9999582 abc9999582 abc9999582]
17/05/09 14:08:29 INFO FileReader: ===
17/05/09 14:08:29 INFO FileReader: Stripe statistics: INT[hasNulls=false, min=10000454, max=19999168]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: LONG[hasNulls=false, min=10000454, max=19999168]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: UTF8[hasNulls=false, min=abc10099454 abc10099454 abc10099454, max=abc19999168 abc19999168 abc19999168]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: INT[hasNulls=false, min=10000454, max=19999168]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: LONG[hasNulls=false, min=10000454, max=19999168]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: UTF8[hasNulls=false, min=abc10099454 abc10099454 abc10099454, max=abc19999168 abc19999168 abc19999168]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: UTF8[hasNulls=false, min=abc10099454 abc10099454 abc10099454, max=abc19999168 abc19999168 abc19999168]
17/05/09 14:08:29 INFO FileReader: Stripe statistics: UTF8[hasNulls=false, min=abc10099454 abc10099454 abc10099454, max=abc19999168 abc19999168 abc19999168]
17/05/09 14:08:29 INFO FileReader: ===

It looks like UTF8String can mutate its underlying byte array, therefore we should copy value when checking in statistics on min/max.

--- a/format/src/main/java/com/github/sadikovi/riff/Statistics.java
+++ b/format/src/main/java/com/github/sadikovi/riff/Statistics.java
@@ -360,8 +360,8 @@ public abstract class Statistics extends GenericInternalRow {
     @Override
     protected void updateState(InternalRow row, int ordinal) {
       UTF8String value = row.getUTF8String(ordinal);
-      min = (min == null) ? value : (min.compareTo(value) > 0 ? value : min);
-      max = (max == null) ? value : (max.compareTo(value) < 0 ? value : max);
+      min = (min == null) ? value.clone() : (min.compareTo(value) > 0 ? value.clone() : min);
+      max = (max == null) ? value.clone() : (max.compareTo(value) < 0 ? value.clone() : max);
     }

Actually there must be a better way of doing this, for example, making update while checking the record.