tidyverse/readxl

read_excel() converts date to number

xinyongtian opened this issue · 3 comments

test1.xlsx
image

I try to covert test1.xlsx to .csv file by using read_excel(), then write.csv()
The excel file test1.xlsx contains date column (DOB) with some typo. After read, the correct DOB date become number (in character type). The read_excel() should at least keep DOB in original text format. If I use MS excel to save as .csv the date format is preserved.

library(readxl)
#> Warning: package 'readxl' was built under R version 4.1.3
df=read_excel('c:/test1.xlsx')
print(df)
#> # A tibble: 2 x 2
#>   NAME  DOB     
#>   <chr> <chr>   
#> 1 Jack  17995   
#> 2 Tom   12/37/38
write.csv(df,'test1.csv',row.names = F)
print(read.csv('test1.csv'))
#>   NAME      DOB
#> 1 Jack    17995
#> 2  Tom 12/37/38

I will attempt to answer this question, having done my best to recreate your circumstances. This is not ideal because I don't have a copy of your Excel file, and most importantly, I don't know the Excel data types of cells B2 and B3. Whether they are General, Date, or Text affects what data type they are imported into R as by read_excel(). Any of the above that you can provide will help me give a more accurate explanation of your output.

I created an imitation of your file by typing in the values that I see above by hand. I observed the following, and am making the assumption that these things are true for your file as well:

  • Cell B2 was interpreted by Excel as a valid date (April 7, 1949) and was automatically assigned the Date type
  • Cell B3 could not be interpreted by Excel as anything other than just text, since it does not represent a valid date or numeric value, and so was kept as the default General type

Moving now to what happens in R: Because column B contains at least one cell that cannot be interpreted as anything other than text (cell B3), read_excel() guessed that the entire column should be treated as character (text) data. This behavior avoids data loss; if it had guessed differently and applied a date type, cell B3 would have to be NA (a blank/missing value), and information that you had before would not have made it into your data.frame.

Cell B2 is imported as "17995" because that is what is actually stored in the Excel spreadsheet. When Excel interpreted what you typed in as a date, it implicitly converted it to the number 17995, which is the number of days between December 30, 1899 and the date entered (this is Microsoft's convention for storing date data). With the Date type applied, Excel knows to display the value as a date, while still storing the number "under the hood".

This article from the readxl site is highly relevant and is my go-to reference for understanding why my Excel data gets imported the way it does: https://readxl.tidyverse.org/articles/cell-and-column-types.html

Here is my personal approach for ensuring that date data from Excel spreadsheets ends up as the intended date values in R:

library(readxl)
library(dplyr)
library(stringr)
library(lubridate)

# a function that takes a character vector that may contain dates in various formats, and attempts to convert each format to a date value appropriately
convert_excel_dates <-
    function(x){
        case_when(
            str_detect(x, "^[0-9]{1,2}/[0-9]{1,2}/[0-9]{2,4}$") ~ mdy(x),   # handles values imported as text values in the format "MM/DD/YYYY"
            str_detect(x, "^[0-9]{5}$")                         ~ x |> as.integer() |> as.Date(origin = as.Date("1899-12-30")),  # handles values imported as numbers expressed as days since 1899-12-30 (Microsoft's convention)
            TRUE                                                ~ NA_Date_  # default case, no applicable date format, returns a missing date value
        )
    }


read_excel("test1.xlsx", col_types = "text") |>  # instruct read_excel() to simply import all columns as text, so that I can make tricky type conversions myself
    mutate(DOB_Converted_to_Date = convert_excel_dates(DOB))
#> # A tibble: 3 × 3
#>   NAME   DOB      DOB_Converted_to_Date  
#>   <chr>  <chr>    <date>    
#> 1 Jack   17995    1949-04-07
#> 2 Tom    12/37/38 NA        
#> 3 George 02/17/80 1980-02-17

Handling of additional date formats can be added to case_when() as needed. I added one more row for this example to show that a value that is imported as text but still follows the "MM/DD/YY" format representing a valid date still gets converted correctly. This can happen if someone types in a valid date like "02/17/80" when the cell's Excel data type is set to Text, and Excel makes no attempt to interpret the value as a Date or number.