r-lib/nanoparquet

Read & write dates, times, related types

Closed this issue · 7 comments

Write:

  • Date
  • POSIXct
  • POSIXlt
  • hms
  • difftime

Read:

  • Date
  • POSIXct
  • POSIXlt
  • hms
  • difftime

Read Parquet logical types:

  • DATE
  • TIME
    • UTC?
    • MILLIS
    • MICROS
    • NANOS
  • TIMESTAMP
    • UTC?
    • MILLIS
    • MICROS
    • NANOS
  • INTERVAL

Read Parquet converted types:

  • DATE
  • TIME_MILLIS
  • TIME_MICROS
  • TIMESTAMP_MILLIS
  • TIMESTAMP_MICROS

The old convention for time stamps

  • INT96

There should be no POSIXlt in the data frame, it should be converted to POSIXct, I think.

Arrow seems to handle difftime specially:

❯ d <- data.frame(x = as.difftime(10 + 1/9, units = "mins"))
❯ arrow::write_parquet(d, "/tmp/difftime.parquet")
❯ parquet_schema("/tmp/difftime.parquet")
# A data frame: 2 × 11
  file_name             name   type  type_length repetition_type converted_type logical_type num_children scale precision field_id
  <chr>                 <chr>  <chr>       <int> <chr>           <chr>          <I<list>>           <int> <int>     <int>    <int>
1 /tmp/difftime.parquet schema NA             NA REQUIRED        NA             <NULL>                  1    NA        NA       NA
2 /tmp/difftime.parquet x      INT64          NA OPTIONAL        NA             <NULL>                 NA    NA        NA       NA
d2 <- arrow::read_parquet("/tmp/difftime.parquet")
❯ d2
# A tibble: 1 × 1
  x       
  <drtn>  
1 606 secs
❯ attributes(d2$x)
$class
[1] "difftime"

$units
[1] "secs"

It is stored in the Arrow schema:

kv <- parquet_metadata("/tmp/difftime.parquet")$file_meta_data$key_value_metadata
parse_arrow_schema(kv[[1]]$value[2])
$columns
# A data frame: 1 × 6
  name  type_type type             nullable dictionary custom_metadata 
  <chr> <chr>     <I<list>>        <lgl>    <I<list>>  <I<list>>       
1 x     Duration  <named list [1]> TRUE     <NULL>     <named list [2]>

$custom_metadata
# A data frame: 1 × 2
  key   value                                                                                                                     
  <chr> <chr>                                                                                                                     
1 r     "A\n3\n263168\n197888\n5\nUTF-8\n531\n1\n531\n1\n254\n1026\n1\n262153\n5\nnames\n16\n1\n262153\n1\nx\n254\n1026\n511\n16\…

$endianness
[1] "Little"

$features
character(0)
parse_arrow_schema(kv[[1]]$value[2])$columns$type
[[1]]
[[1]]$unit
[1] "SECOND"

Seems like we also need to write difftime as INT64, because if it is DOUBLE, then Arrow will not read it back as difftime.

Considering that the INTERVAL type is better for difftime, anyway, maybe we should just support that, instead of losing information with the INT64 conversion.

Or maybe we can use INT64, but use a different unit, e.g. millis, micros or nanos? And convert between that and secs when reading and writing? If the arrow R package is OK with reading it back into difftime?

(We could probably also just skip difftime support for now?)

When reading TIME and TIMESTAMP logical types, we're just going to ignore the UTC field. It is completely implementation dependent, anyway, so it is not clear at all what we should fo with it. (In fact, we are already doing this at the time of writing.)

EDIT: Arrow uses the UTC field to decide whether to set the tzone attribute, so we do the same for POSIXct. We don't use it for TIME, though.

We don't need to support INTERVAL now. Maybe we can support it later, either by:

  • converting to hms,
  • reading it into a list column of three integers, or
  • simply reading the by array.

None of these are great, though.