Support Time Columns in Parquet Record API
pacman82 opened this issue · 6 comments
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
As the author of the upstream odbc2parquet
I would like for parquet
to support writing Time
columns with MicroSeconds
precision, so I can offer more precise types to my users. This is especially helpful if odbc2parquet
is used as part of a data pipeline in which the next steps relies on legacy converted types, since these do not have pendant for NanoSeconds
precision.
Currently if I try to write a column of that type parquet
emits a panic.
not implemented: Conversion for physical type INT64, converted type TIME_MICROS, value 334000000
Describe the solution you'd like
If I am not mistaken the interface should just be analogous to writing Time columns with NanoSeconds
precision.
Describe alternatives you've considered
Lacking support in the parquet crate, I am unlikely to offer that improvement.
Additional context
Here is the original issue that triggered support for Time in odbc2parquet
: pacman82/odbc2parquet#285
Sadly so far I can only offer it with Nanoseconds precision.
The record API does not appear to support time types at all - https://docs.rs/parquet/latest/parquet/record/enum.Field.html
As a workaround I would suggest using either the arrow APIs or lower-level writer APIs as they support this and are more actively maintained. Otherwise I would also be happy to review a PR adding support for this, but am unlikely to have time to work on this myself.
Hello @tustvold ,
I would suggest using either the arrow APIs or lower-level writer API
I do not use the record API, but am using the lower-level writer API, as returned by:
Int64Type::get_column_writer_mut(column_writer).unwrap();
// ...
cw.write_batch(values, Some(def_levels), None)?;
Do you have a reproducer, the following works correctly
#[test]
fn test_time() {
let schema = Arc::new(
types::Type::group_type_builder("schema")
.with_fields(&mut vec![Arc::new(
types::Type::primitive_type_builder("col1", Type::INT64)
.with_repetition(Repetition::REQUIRED)
.with_converted_type(ConvertedType::TIME_MICROS)
.build()
.unwrap(),
)])
.build()
.unwrap(),
);
let mut out = Vec::with_capacity(1024);
let props = Arc::new(WriterProperties::builder().build());
let mut writer = SerializedFileWriter::new(&mut out, schema, props).unwrap();
let mut row_group = writer.next_row_group().unwrap();
let mut column = row_group.next_column().unwrap().unwrap();
column
.typed::<Int64Type>()
.write_batch(&[1, 2, 3], None, None)
.unwrap();
column.close().unwrap();
row_group.close().unwrap();
writer.close().unwrap();
}
not implemented: Conversion for physical type INT64, converted type TIME_MICROS, value 334000000
This error originates from the Field::convert_
methods, found here which are only used by the record API
Hello @tustvold ,
thanks for your response. I apologize, for stating previously that I did not use the record API. While the production code of odbc2parquet
does not it uses parquet-read
in its integration tests which throws the error. I am sorry for the time you had to spend chasing this down.
Best, Markus
No worries at all, glad we got to the bottom of it
Thank you a lot and a merry Christmas!