jorgecarleitao/arrow2

Reading Parquet int96 timestamps into arrow timestamps overflows

Opened this issue · 0 comments

I am running into an issue where reading Parquet int96 timestamps into arrow2 timestamp[ns] arrays can potentially overflow silently, providing wrong results.

This issue was also noted in pyarrow/arrow-cpp ARROW-12096.


Here is a quick example:

First, write a Parquet file with int96 timestamps, where some timestamps are out of range for the timestamp[ns] type:

import pyarrow as pa
import pyarrow.parquet as papq
import datetime

# Use PyArrow to write Parquet files with int96 timestamps
table = pa.Table.from_pydict({
    "timestamps": pa.array([
        datetime.datetime(1000, 1, 1),
        datetime.datetime(2000, 1, 1),
        datetime.datetime(3000, 1, 1),
    ], pa.timestamp("ms"))
})
papq.write_table(table, "timestamps.parquet", use_deprecated_int96_timestamps=True, store_schema=False)

Reading this file in a unit test results in an overflow panic:

#[test]
fn read_int96_timestamps() -> Result<()> {
    let timestamp_data = &[
        0x50, 0x41, 0x52, 0x31, 0x15, 0x04, 0x15, 0x48, 0x15, 0x3c, 0x4c, 0x15, 0x06, 0x15, 0x00,
        0x12, 0x00, 0x00, 0x24, 0x00, 0x00, 0x0d, 0x01, 0x08, 0x9f, 0xd5, 0x1f, 0x0d, 0x0a, 0x44,
        0x00, 0x00, 0x59, 0x68, 0x25, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x14,
        0xfb, 0x2a, 0x00, 0x15, 0x00, 0x15, 0x14, 0x15, 0x18, 0x2c, 0x15, 0x06, 0x15, 0x10, 0x15,
        0x06, 0x15, 0x06, 0x1c, 0x00, 0x00, 0x00, 0x0a, 0x24, 0x02, 0x00, 0x00, 0x00, 0x06, 0x01,
        0x02, 0x03, 0x24, 0x00, 0x26, 0x9e, 0x01, 0x1c, 0x15, 0x06, 0x19, 0x35, 0x10, 0x00, 0x06,
        0x19, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74, 0x61, 0x6d, 0x70, 0x73, 0x15, 0x02,
        0x16, 0x06, 0x16, 0x9e, 0x01, 0x16, 0x96, 0x01, 0x26, 0x60, 0x26, 0x08, 0x29, 0x2c, 0x15,
        0x04, 0x15, 0x00, 0x15, 0x02, 0x00, 0x15, 0x00, 0x15, 0x10, 0x15, 0x02, 0x00, 0x00, 0x00,
        0x15, 0x04, 0x19, 0x2c, 0x35, 0x00, 0x18, 0x06, 0x73, 0x63, 0x68, 0x65, 0x6d, 0x61, 0x15,
        0x02, 0x00, 0x15, 0x06, 0x25, 0x02, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74, 0x61,
        0x6d, 0x70, 0x73, 0x00, 0x16, 0x06, 0x19, 0x1c, 0x19, 0x1c, 0x26, 0x9e, 0x01, 0x1c, 0x15,
        0x06, 0x19, 0x35, 0x10, 0x00, 0x06, 0x19, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74,
        0x61, 0x6d, 0x70, 0x73, 0x15, 0x02, 0x16, 0x06, 0x16, 0x9e, 0x01, 0x16, 0x96, 0x01, 0x26,
        0x60, 0x26, 0x08, 0x29, 0x2c, 0x15, 0x04, 0x15, 0x00, 0x15, 0x02, 0x00, 0x15, 0x00, 0x15,
        0x10, 0x15, 0x02, 0x00, 0x00, 0x00, 0x16, 0x9e, 0x01, 0x16, 0x06, 0x26, 0x08, 0x16, 0x96, 
        0x01, 0x14, 0x00, 0x00, 0x28, 0x20, 0x70, 0x61, 0x72, 0x71, 0x75, 0x65, 0x74, 0x2d, 0x63,
        0x70, 0x70, 0x2d, 0x61, 0x72, 0x72, 0x6f, 0x77, 0x20, 0x76, 0x65, 0x72, 0x73, 0x69, 0x6f,
        0x6e, 0x20, 0x31, 0x32, 0x2e, 0x30, 0x2e, 0x30, 0x19, 0x1c, 0x1c, 0x00, 0x00, 0x00, 0x95,
        0x00, 0x00, 0x00, 0x50, 0x41, 0x52, 0x31
    ];
    let mut reader = Cursor::new(timestamp_data);

    let metadata = read_metadata(&mut reader)?;
    let schema = infer_schema(&metadata)?;
    let reader = FileReader::new(reader, metadata.row_groups, schema, Some(5), None, None);

    let x = reader.collect::<Result<Vec<_>>>().unwrap();
    println!("{:?}", x);
    Ok(())
}
---- io::parquet::read::read_int96_timestamps stdout ----
thread 'io::parquet::read::read_int96_timestamps' panicked at 'attempt to multiply with overflow', /Users/jaychia/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.17.2/src/types.rs:112:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Solution

I would like to propose a two part solution:

  1. When reading int96 data, we should take into account the requested TimeUnit instead of always defaulting to Nanoseconds (see PR: #1532 )
  2. When inferring a schema from Parquet, we should allow clients to pass in ParquetSchemaInferenceOptions, which will allow users to specify how they want arrow2 to infer the Arrow types for Parquet Int96 types. (see PR: #1533 )
arrow2::io::parquet::read::schema::infer_schema_with_options(..., options: ParquetSchemaInferenceOptions)