Reading Parquet int96 timestamps into arrow timestamps overflows
Opened this issue · 0 comments
jaychia commented
I am running into an issue where reading Parquet int96 timestamps into arrow2 timestamp[ns]
arrays can potentially overflow silently, providing wrong results.
This issue was also noted in pyarrow/arrow-cpp ARROW-12096.
Here is a quick example:
First, write a Parquet file with int96 timestamps, where some timestamps are out of range for the timestamp[ns]
type:
import pyarrow as pa
import pyarrow.parquet as papq
import datetime
# Use PyArrow to write Parquet files with int96 timestamps
table = pa.Table.from_pydict({
"timestamps": pa.array([
datetime.datetime(1000, 1, 1),
datetime.datetime(2000, 1, 1),
datetime.datetime(3000, 1, 1),
], pa.timestamp("ms"))
})
papq.write_table(table, "timestamps.parquet", use_deprecated_int96_timestamps=True, store_schema=False)
Reading this file in a unit test results in an overflow panic:
#[test]
fn read_int96_timestamps() -> Result<()> {
let timestamp_data = &[
0x50, 0x41, 0x52, 0x31, 0x15, 0x04, 0x15, 0x48, 0x15, 0x3c, 0x4c, 0x15, 0x06, 0x15, 0x00,
0x12, 0x00, 0x00, 0x24, 0x00, 0x00, 0x0d, 0x01, 0x08, 0x9f, 0xd5, 0x1f, 0x0d, 0x0a, 0x44,
0x00, 0x00, 0x59, 0x68, 0x25, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x14,
0xfb, 0x2a, 0x00, 0x15, 0x00, 0x15, 0x14, 0x15, 0x18, 0x2c, 0x15, 0x06, 0x15, 0x10, 0x15,
0x06, 0x15, 0x06, 0x1c, 0x00, 0x00, 0x00, 0x0a, 0x24, 0x02, 0x00, 0x00, 0x00, 0x06, 0x01,
0x02, 0x03, 0x24, 0x00, 0x26, 0x9e, 0x01, 0x1c, 0x15, 0x06, 0x19, 0x35, 0x10, 0x00, 0x06,
0x19, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74, 0x61, 0x6d, 0x70, 0x73, 0x15, 0x02,
0x16, 0x06, 0x16, 0x9e, 0x01, 0x16, 0x96, 0x01, 0x26, 0x60, 0x26, 0x08, 0x29, 0x2c, 0x15,
0x04, 0x15, 0x00, 0x15, 0x02, 0x00, 0x15, 0x00, 0x15, 0x10, 0x15, 0x02, 0x00, 0x00, 0x00,
0x15, 0x04, 0x19, 0x2c, 0x35, 0x00, 0x18, 0x06, 0x73, 0x63, 0x68, 0x65, 0x6d, 0x61, 0x15,
0x02, 0x00, 0x15, 0x06, 0x25, 0x02, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74, 0x61,
0x6d, 0x70, 0x73, 0x00, 0x16, 0x06, 0x19, 0x1c, 0x19, 0x1c, 0x26, 0x9e, 0x01, 0x1c, 0x15,
0x06, 0x19, 0x35, 0x10, 0x00, 0x06, 0x19, 0x18, 0x0a, 0x74, 0x69, 0x6d, 0x65, 0x73, 0x74,
0x61, 0x6d, 0x70, 0x73, 0x15, 0x02, 0x16, 0x06, 0x16, 0x9e, 0x01, 0x16, 0x96, 0x01, 0x26,
0x60, 0x26, 0x08, 0x29, 0x2c, 0x15, 0x04, 0x15, 0x00, 0x15, 0x02, 0x00, 0x15, 0x00, 0x15,
0x10, 0x15, 0x02, 0x00, 0x00, 0x00, 0x16, 0x9e, 0x01, 0x16, 0x06, 0x26, 0x08, 0x16, 0x96,
0x01, 0x14, 0x00, 0x00, 0x28, 0x20, 0x70, 0x61, 0x72, 0x71, 0x75, 0x65, 0x74, 0x2d, 0x63,
0x70, 0x70, 0x2d, 0x61, 0x72, 0x72, 0x6f, 0x77, 0x20, 0x76, 0x65, 0x72, 0x73, 0x69, 0x6f,
0x6e, 0x20, 0x31, 0x32, 0x2e, 0x30, 0x2e, 0x30, 0x19, 0x1c, 0x1c, 0x00, 0x00, 0x00, 0x95,
0x00, 0x00, 0x00, 0x50, 0x41, 0x52, 0x31
];
let mut reader = Cursor::new(timestamp_data);
let metadata = read_metadata(&mut reader)?;
let schema = infer_schema(&metadata)?;
let reader = FileReader::new(reader, metadata.row_groups, schema, Some(5), None, None);
let x = reader.collect::<Result<Vec<_>>>().unwrap();
println!("{:?}", x);
Ok(())
}
---- io::parquet::read::read_int96_timestamps stdout ----
thread 'io::parquet::read::read_int96_timestamps' panicked at 'attempt to multiply with overflow', /Users/jaychia/.cargo/registry/src/github.com-1ecc6299db9ec823/parquet2-0.17.2/src/types.rs:112:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Solution
I would like to propose a two part solution:
- When reading int96 data, we should take into account the requested TimeUnit instead of always defaulting to Nanoseconds (see PR: #1532 )
- When inferring a schema from Parquet, we should allow clients to pass in
ParquetSchemaInferenceOptions
, which will allow users to specify how they wantarrow2
to infer the Arrow types for ParquetInt96
types. (see PR: #1533 )
arrow2::io::parquet::read::schema::infer_schema_with_options(..., options: ParquetSchemaInferenceOptions)