Read & write dates, times, related types
Closed this issue · 7 comments
Write:
- Date
- POSIXct
- POSIXlt
- hms
- difftime
Read:
- Date
- POSIXct
- POSIXlt
- hms
- difftime
Read Parquet logical types:
- DATE
- TIME
- UTC?
- MILLIS
- MICROS
- NANOS
- TIMESTAMP
- UTC?
- MILLIS
- MICROS
- NANOS
- INTERVAL
Read Parquet converted types:
- DATE
- TIME_MILLIS
- TIME_MICROS
- TIMESTAMP_MILLIS
- TIMESTAMP_MICROS
The old convention for time stamps
- INT96
There should be no POSIXlt in the data frame, it should be converted to POSIXct, I think.
Arrow seems to handle difftime specially:
❯ d <- data.frame(x = as.difftime(10 + 1/9, units = "mins"))
❯ arrow::write_parquet(d, "/tmp/difftime.parquet")
❯ parquet_schema("/tmp/difftime.parquet")
# A data frame: 2 × 11
file_name name type type_length repetition_type converted_type logical_type num_children scale precision field_id
<chr> <chr> <chr> <int> <chr> <chr> <I<list>> <int> <int> <int> <int>
1 /tmp/difftime.parquet schema NA NA REQUIRED NA <NULL> 1 NA NA NA
2 /tmp/difftime.parquet x INT64 NA OPTIONAL NA <NULL> NA NA NA NA
❯ d2 <- arrow::read_parquet("/tmp/difftime.parquet")
❯ d2
# A tibble: 1 × 1
x
<drtn>
1 606 secs
❯ attributes(d2$x)
$class
[1] "difftime"
$units
[1] "secs"
It is stored in the Arrow schema:
kv <- parquet_metadata("/tmp/difftime.parquet")$file_meta_data$key_value_metadata
parse_arrow_schema(kv[[1]]$value[2])
$columns
# A data frame: 1 × 6
name type_type type nullable dictionary custom_metadata
<chr> <chr> <I<list>> <lgl> <I<list>> <I<list>>
1 x Duration <named list [1]> TRUE <NULL> <named list [2]>
$custom_metadata
# A data frame: 1 × 2
key value
<chr> <chr>
1 r "A\n3\n263168\n197888\n5\nUTF-8\n531\n1\n531\n1\n254\n1026\n1\n262153\n5\nnames\n16\n1\n262153\n1\nx\n254\n1026\n511\n16\…
$endianness
[1] "Little"
$features
character(0)
parse_arrow_schema(kv[[1]]$value[2])$columns$type
[[1]]
[[1]]$unit
[1] "SECOND"
Seems like we also need to write difftime
as INT64
, because if it is DOUBLE
, then Arrow will not read it back as difftime
.
Considering that the INTERVAL
type is better for difftime
, anyway, maybe we should just support that, instead of losing information with the INT64 conversion.
Or maybe we can use INT64, but use a different unit, e.g. millis, micros or nanos? And convert between that and secs when reading and writing? If the arrow R package is OK with reading it back into difftime?
(We could probably also just skip difftime support for now?)
When reading TIME and TIMESTAMP logical types, we're just going to ignore the UTC field. It is completely implementation dependent, anyway, so it is not clear at all what we should fo with it. (In fact, we are already doing this at the time of writing.)
EDIT: Arrow uses the UTC field to decide whether to set the tzone
attribute, so we do the same for POSIXct. We don't use it for TIME
, though.
We don't need to support INTERVAL now. Maybe we can support it later, either by:
- converting to
hms
, - reading it into a list column of three integers, or
- simply reading the by array.
None of these are great, though.