sfirke/janitor

excel_numeric_to_date gives NA output for times in daylight savings time gap

Geomorph2 opened this issue · 7 comments

The function excel_numeric_to_date seems to work fine with most input, but fails on others when include_time is specified.

Example:

excel_numeric_to_date(43170.08, include_time=TRUE)
[1] "2018-03-11 01:55:12 PST"
excel_numeric_to_date(43170.09, include_time=TRUE)
[1] NA
excel_numeric_to_date(43170.10, include_time=TRUE)
[1] NA
excel_numeric_to_date(43170.11, include_time=TRUE)
[1] NA
excel_numeric_to_date(43170.12, include_time=TRUE)
[1] NA
excel_numeric_to_date(43170.13, include_time=TRUE)
[1] "2018-03-11 03:07:12 PDT"

Most of the values where the function fails seem to end in .09 through 0.11 range.

The problem here is that 2am-3am does not exist in the PDT time zone due to daylight savings time.

> excel_numeric_to_date(43170.09, include_time=TRUE)
[1] NA
> excel_numeric_to_date(43170.09, include_time=TRUE, tz="UTC")
[1] "2018-03-11 02:09:36 UTC"

That said, it is a bug assuming that the time does exist in Excel. I'll have to look into Excel daylight savings time handling.

As I've looked into it a bit more, I'm not sure what a good solution is, and I would welcome input. Excel seems to treat all date/time values as though they are like UTC (without daylight savings time), but I don't think that forcing the excel_numeric_to_date() results to be UTC is the best solution.

My first thought is to give a warning if an NA is produced, and warn the user that the time may be during the daylight savings time break, so they should consider another time zone (such as UTC).

Thoughts?

Thanks Bill! Changing the time zone to UTC worked.

My guess is that if people are converting Excel dates to real dates, they will expect the function to give the same output as it would in Excel. I was handed this dataset and my colleagues expected me to examine it in Excel like they were doing. Thus, consistency with Excel is probably an overriding constraint.

Thank you for the feedback. This is complicated because some users will expect the dates to work as in Excel and some would expect the dates to work as in R. I don't know what is more common. I try to be conservative in changing behavior of functions, and this would alter the existing behavior if we changed the output.

To avoid changing the behavior of the function, I will add the warning if values are changed to NA that you may want to consider UTC. Excel dates and times are notoriously problematic (no time zone consideration as is an issue here; 1900 is considered a leap year [https://en.wikipedia.org/wiki/Leap_year_problem]), and while I want to try to work around as many of those as possible, the subtlety of this one makes it hard.

@almartin82, I like the idea of the ambiguous argument (documented here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tz_localize.html). It looks like ambiguous handles the other side of the issue-- when two dates exist at a time due to daylight savings time (which should also be handled) rather than this where no time exists which is handled by nonexistent.

For reference, lubridate seems to assume UTC in its dates without time zones.

Closed by #422, see more discussion there