Consider a readr-style "problems" report on parsing failures

Question

Consider a readr-style "problems" report on parsing failures

akgold opened this issue 3 years ago · 5 comments

On read, a column is assumed to be logical after some (maybe 1,000?) rows of missing data. If a character is at the bottom, then this silently fails with no error.

my_str <- "My Str"
empty_sheet_url <- "<url>"
tbl <- tibble::tibble(i = 1:2000, str = c(rep("", 1999), my_str))
googlesheets4::write_sheet(tbl, empty_sheet_url)

df <- googlesheets4::read_sheet(empty_sheet_url, "tbl")
testthat::expect_type(df$str, "character") #fails, is logical
testthat::expect_equal(df$str, c(rep("", 1999), my_str)) #fails, is all NA

Answer 1 · 2021-07-11T18:40:10.000Z

I'd say this is all working as documented. This is how googlesheets4, readxl, and readr have always worked: there's a guess_max parameter that controls how many rows we look at when guessing column type.

(It's possible that round tripping empty strings should be easier? But that's a separate matter.)

library(googlesheets4)
library(googledrive)
library(tidyverse)

# hidden auth chunk here

tbl <- tibble(i = 1:2000, str = c(rep("", 1999), "blah"))
ss <- gs4_create(sheets = tbl)
#> ✓ Creating new Sheet: 'compulsory-fanworms'

# saw some weird errors, so let's slow things down
Sys.sleep(2)

dat <- read_sheet(ss, "tbl")
#> ✓ Reading from 'compulsory-fanworms'
#> ✓ Range ''tbl''
dat
#> # A tibble: 2,000 x 2
#>        i str  
#>    <dbl> <lgl>
#>  1     1 NA   
#>  2     2 NA   
#>  3     3 NA   
#>  4     4 NA   
#>  5     5 NA   
#>  6     6 NA   
#>  7     7 NA   
#>  8     8 NA   
#>  9     9 NA   
#> 10    10 NA   
#> # … with 1,990 more rows
tail(dat)
#> # A tibble: 6 x 2
#>       i str  
#>   <dbl> <lgl>
#> 1  1995 NA   
#> 2  1996 NA   
#> 3  1997 NA   
#> 4  1998 NA   
#> 5  1999 NA   
#> 6  2000 NA

The guess_max argument is important:

guess_max Maximum number of data rows to use for guessing column types.

Its default is 1000 (or n_max, if that’s smaller):

read_sheet(ss, ..., guess_max = min(1000, n_max))

This is also how readxl and readr work.

dat <- read_sheet(ss, "tbl", guess_max = Inf)
#> ✓ Reading from 'compulsory-fanworms'
#> ✓ Range ''tbl''
dat
#> # A tibble: 2,000 x 2
#>        i str  
#>    <dbl> <chr>
#>  1     1 <NA> 
#>  2     2 <NA> 
#>  3     3 <NA> 
#>  4     4 <NA> 
#>  5     5 <NA> 
#>  6     6 <NA> 
#>  7     7 <NA> 
#>  8     8 <NA> 
#>  9     9 <NA> 
#> 10    10 <NA> 
#> # … with 1,990 more rows
tail(dat)
#> # A tibble: 6 x 2
#>       i str  
#>   <dbl> <chr>
#> 1  1995 <NA> 
#> 2  1996 <NA> 
#> 3  1997 <NA> 
#> 4  1998 <NA> 
#> 5  1999 <NA> 
#> 6  2000 blah

is.character(dat$str)
#> [1] TRUE

An absence of cell data is brought in as NA, but we can use
tidyr::replace_na() to bring back explicit empty strings, if desired.

dat <- dat %>% 
  replace_na(list(str = ""))

identical(tbl$str, dat$str)
#> [1] TRUE

drive_rm(ss)
#> File deleted:
#> • 'compulsory-fanworms' <id: 1Mu0JrRh0ZASocbL0SosFeCYvaC-4_ECCoRmkkixJfTU>

^{Created on 2021-07-11 by the reprex package (v2.0.0.9000)}

Answer 2 · 2021-07-12T13:07:25.000Z

That totally makes sense. The thing that made this difficult for me to puzzle out was the silence of this potentially destructive behavior. I wasn't aware of guess_max, and no error or warning occurred here to alert me to that being what was going on. My "solution" was to put a junk character in the first row so it would guess correctly.

Maybe there's no easy way to tell when this occurs, but something like, "First value non-blank value in col str found after guess_max, coercing to logical." would've been super helpful.

Answer 3 · 2021-07-12T15:39:11.000Z

In readr this would show up in the "problems" report. I'll reopen this.

readr::read_csv("x,y\na,\nc,d", guess_max = 1)
#> Warning: 1 parsing failure.
#> row col           expected actual         file
#>   2   y 1/0/T/F/TRUE/FALSE      d literal data
#> # A tibble: 2 x 2
#>   x     y    
#>   <chr> <lgl>
#> 1 a     NA   
#> 2 c     NA

^{Created on 2021-07-12 by the reprex package (v2.0.0.9000)}

Answer 4 · 2021-07-12T15:42:24.000Z

My "solution" was to put a junk character in the first row so it would guess correctly.

Just as an FYI for all of these packages, I think it's more common to do guess_max = Inf either as a policy (if one works with datasets where the size is not prohibitive for this) or when doing the initial "getting to know you" intake of a dataset. Definitely better than editing the data itself. The real gold standard is to provide column types (I would definitely specify that a column is a character, before I added junk data to influence guessing).

Answer 5 · 2021-07-12T20:30:29.000Z

This will presumably get bundled up with the Great Col Spec Project that is coming soon for me (googlesheets4 & readxl) #51.