aloneguid/parquet-dotnet

Large float column leads to incorrect values being read

Closed this issue · 4 comments

Version: Parquet.Net 3.7.7
Runtime Version: .Net Core 3.1
OS: Windows 10 2004

Expected behavior

I have written a single-column parquet file with Arrow using Snappy compression. It has 270K rows using a single row group with the float values {1, 2, 3, ..., 270000}. Reading the entire column should return those values.

Actual behavior

When reading the column with Parquet.NET, large slices of the Data contain zeroes instead of the expected values. This starts fairly early on in the array; for example at index 38.

Steps to reproduce the behavior

Use the attached parquet file (floats.parquet.zip) and read using the following code:

using var stream = File.OpenRead("floats.parquet");
using var parquetReader = new ParquetReader(stream);
var dataColumns = parquetReader.ReadEntireRowGroup();
var values = (float[]) dataColumns[0].Data;

Console.WriteLine($"values {{{string.Join(", ", values[0..50])}}}");

Observe that a lot of the values will be zeroes:

values {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0}

General thoughts

Reading the file in Python using PyArrow works fine.

import pandas as pd
import pyarrow.parquet as pq

table = pq.read_table('floats.parquet').to_pandas()

pd.set_option('display.min_rows', 100)
pd.set_option('display.max_rows', 100)

print(len(table["Value"]))
print(table["Value"])

The problem does not manifest itself when using a smaller number of rows. Also, each value is unique and dictionary encoding has not been disabled. So it could be caused by a bug in the dictionary decoding.

Are you able to read the same parquet file with ParquetSharp?

It seems the problem is creating the indexes list in RunLengthBitPackingHybridValuesReader.ReadRleBitpackedHybrid/ReadBitpacked

Hi @felipepessoto, yes the file reads fine with ParquetSharp (it uses the same C++ implementation as PyArrow under the hood).

I think I found the problem. When the "accumulator" (I don't know how call it, this code is new to me) first bit becomes 1, when rawBytes[i] > 127:

, we have a problem, because the number will be negative, and Int32 will perform a arithmetic shift here: and here

I changed the type to uint, which performs logical shift.

I'm curious how this problem didn't happened before, because is very likely to happen, and critical, since it can cause data loss (if you read data and save it back)