Support Time Columns in Parquet Record API

Question

Support Time Columns in Parquet Record API

pacman82 opened this issue 2 years ago · 6 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

As the author of the upstream odbc2parquet I would like for parquet to support writing Time columns with MicroSeconds precision, so I can offer more precise types to my users. This is especially helpful if odbc2parquet is used as part of a data pipeline in which the next steps relies on legacy converted types, since these do not have pendant for NanoSeconds precision.

Currently if I try to write a column of that type parquet emits a panic.

not implemented: Conversion for physical type INT64, converted type TIME_MICROS, value 334000000

Describe the solution you'd like

If I am not mistaken the interface should just be analogous to writing Time columns with NanoSeconds precision.

Describe alternatives you've considered

Lacking support in the parquet crate, I am unlikely to offer that improvement.

Additional context

Here is the original issue that triggered support for Time in odbc2parquet: pacman82/odbc2parquet#285
Sadly so far I can only offer it with Nanoseconds precision.

Answer 1 · 2022-12-18T20:00:27.000Z

The record API does not appear to support time types at all - https://docs.rs/parquet/latest/parquet/record/enum.Field.html

As a workaround I would suggest using either the arrow APIs or lower-level writer APIs as they support this and are more actively maintained. Otherwise I would also be happy to review a PR adding support for this, but am unlikely to have time to work on this myself.

Answer 2 · 2022-12-19T10:51:37.000Z

Hello @tustvold ,

I would suggest using either the arrow APIs or lower-level writer API

I do not use the record API, but am using the lower-level writer API, as returned by:

Int64Type::get_column_writer_mut(column_writer).unwrap();
// ...
cw.write_batch(values, Some(def_levels), None)?;

Answer 3 · 2022-12-19T14:33:24.000Z

Do you have a reproducer, the following works correctly

#[test]
fn test_time() {
    let schema = Arc::new(
        types::Type::group_type_builder("schema")
            .with_fields(&mut vec![Arc::new(
                types::Type::primitive_type_builder("col1", Type::INT64)
                    .with_repetition(Repetition::REQUIRED)
                    .with_converted_type(ConvertedType::TIME_MICROS)
                    .build()
                    .unwrap(),
            )])
            .build()
            .unwrap(),
    );

    let mut out = Vec::with_capacity(1024);
    let props = Arc::new(WriterProperties::builder().build());
    let mut writer = SerializedFileWriter::new(&mut out, schema, props).unwrap();

    let mut row_group = writer.next_row_group().unwrap();
    let mut column = row_group.next_column().unwrap().unwrap();
    column
        .typed::<Int64Type>()
        .write_batch(&[1, 2, 3], None, None)
        .unwrap();
    column.close().unwrap();
    row_group.close().unwrap();

    writer.close().unwrap();
}

not implemented: Conversion for physical type INT64, converted type TIME_MICROS, value 334000000

This error originates from the Field::convert_ methods, found here which are only used by the record API

Answer 4 · 2022-12-20T17:35:59.000Z

Hello @tustvold ,

thanks for your response. I apologize, for stating previously that I did not use the record API. While the production code of odbc2parquet does not it uses parquet-read in its integration tests which throws the error. I am sorry for the time you had to spend chasing this down.

Best, Markus

Answer 5 · 2022-12-20T17:42:47.000Z

No worries at all, glad we got to the bottom of it 👍

Answer 6 · 2022-12-20T18:59:29.000Z

Thank you a lot and a merry Christmas!