mgirlich/tibblify

tib_df and empty array

Opened this issue · 2 comments

krlmlr commented

I'm seeing weird references to "colmajor" when an empty JSON array [] is parsed by a tib_df() . What am I doing wrong?

CC @TSchiefer.

library(tibblify)

json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": [] }]'
nested_list <- jsonlite::fromJSON(json)

spec <- tibblify::guess_tspec(nested_list)
spec
#> tspec_df(
#>   tib_int("a"),
#>   tib_df(
#>     "b",
#>     tib_int("c", required = FALSE),
#>     tib_int("d", required = FALSE),
#>   ),
#> )
tibblify::tibblify(nested_list, spec)
#> Error in `tibblify::tibblify()`:
#> ! Problem while tibblifying `x$b[[2]]$c`
#> Caused by error in `withCallingHandlers()`:
#> ! Field is absent in colmajor.
#> ℹ In file 'add-value.c' at line 395.
#> ℹ This is an internal error that was detected in the base package.
#> Backtrace:
#>     ▆
#>  1. ├─tibblify::tibblify(nested_list, spec)
#>  2. │ └─rlang::try_fetch(...)
#>  3. │   ├─base::tryCatch(...)
#>  4. │   │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>  5. │   │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>  6. │   │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>  7. │   └─base::withCallingHandlers(...)
#>  8. └─rlang:::stop_internal_c_lib(...)
#>  9.   └─rlang::abort(message, call = call, .internal = TRUE, .frame = frame)

json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": [{ "c": 1 }] }]'
nested_list <- jsonlite::fromJSON(json)

spec <- tibblify::guess_tspec(nested_list)
spec
#> tspec_df(
#>   tib_int("a"),
#>   tib_df(
#>     "b",
#>     tib_int("c"),
#>     tib_int("d", required = FALSE),
#>   ),
#> )
tibblify::tibblify(nested_list, spec)
#> Error in `tibblify::tibblify()`:
#> ! Field d is required but does not exist in `x$b[[2]]`.
#> ℹ For `.input_form = "colmajor"` every field is required.
#> Backtrace:
#>      ▆
#>   1. ├─tibblify::tibblify(nested_list, spec)
#>   2. │ └─rlang::try_fetch(...)
#>   3. │   ├─base::tryCatch(...)
#>   4. │   │ └─base (local) tryCatchList(expr, classes, parentenv, handlers)
#>   5. │   │   └─base (local) tryCatchOne(expr, names, parentenv, handlers[[1L]])
#>   6. │   │     └─base (local) doTryCatch(return(expr), name, parentenv, handler)
#>   7. │   └─base::withCallingHandlers(...)
#>   8. └─tibblify:::stop_required_colmajor(`<named list>`)
#>   9.   └─tibblify:::tibblify_abort(msg)
#>  10.     └─cli::cli_abort(..., class = "tibblify_error", .envir = .envir)
#>  11.       └─rlang::abort(...)

json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": null }]'
nested_list <- jsonlite::fromJSON(json)

spec <- tibblify::guess_tspec(nested_list)
spec
#> tspec_df(
#>   tib_int("a"),
#>   tib_df(
#>     "b",
#>     tib_int("c", required = FALSE),
#>     tib_int("d", required = FALSE),
#>   ),
#> )
tibblify::tibblify(nested_list, spec)
#> # A tibble: 2 × 2
#>       a                  b
#>   <int> <list<tibble[,2]>>
#> 1     1            [2 × 2]
#> 2     2

Created on 2023-04-17 with reprex v2.0.2

This is because the code path for colmajor is used when the input is a data frame. This makes the error message indeed quite confusing. Regarding the errors themselves:

  1. Empty tibble
json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": [] }]'
nested_list <- tibble::as_tibble(jsonlite::fromJSON(json))
nested_list
#> # A tibble: 2 × 2
#>       a b           
#>   <int> <list>      
#> 1     1 <df [2 × 2]>
#> 2     2 <df [0 × 0]>

Created on 2023-07-07 with reprex v2.0.2

In the colmajor format (and therefore data frames) all columns are required. So, to me it kind of makes sense to error here but it is also quite confusing.

  1. No column d
json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": [{ "c": 1 }] }]'
nested_list <- tibble::as_tibble(jsonlite::fromJSON(json))
nested_list
#> # A tibble: 2 × 2
#>       a b           
#>   <int> <list>      
#> 1     1 <df [2 × 2]>
#> 2     2 <df [1 × 1]>

Created on 2023-07-07 with reprex v2.0.2

Basically the same case as before.

  1. NULL
json <- '[{ "a": 1, "b": [{ "c": 1, "d": 2 }, {}] }, { "a": 2, "b": null }]'
nested_list <- tibble::as_tibble(jsonlite::fromJSON(json))
nested_list
#> # A tibble: 2 × 2
#>       a b           
#>   <int> <list>      
#> 1     1 <df [2 × 2]>
#> 2     2 <NULL>

Created on 2023-07-07 with reprex v2.0.2

This works because NULL gets a special treatment as the missing value of a list.

But it is also a bit annoying that all examples work with the same spec if using simplifyDataFrame = FALSE.