SymbolixAU/jsonify

to_ndjson

Opened this issue · 18 comments

where each row/column becomes its own object in an array.


  • data.frame by-row & by-column
  • list elements
  • matrix by-row & by-column
  • remove trailing '\n'

I found this issue stub while researching jsonify to create ndjson for Google BigQuery inserts. If it's helpful to have another use-case from the wild in mind, here's the BigQuery edition. Most of this is from the documentation here.

(I have some example data but not a full reprex below. If followup seems useful to you I'll happily generate more detail.)

BigQuery efficiently ingests nested/repeated data when it's structured as a single JSON per newline-delimited row, without wrapping the row-object in an array. Each row in an R dataframe (with arbitrary list columns) becomes a single line wrapped in {...} instead of [{...}]. The example data from BigQuery docs looks like this:

{"id":"1","first_name":"John","last_name":"Doe","dob":"1968-01-22","addresses":[{"status":"current","address":"123 First Avenue","city":"Seattle","state":"WA","zip":"11111","numberOfYears":"1"},{"status":"previous","address":"456 Main Street","city":"Portland","state":"OR","zip":"22222","numberOfYears":"5"}]}
{"id":"2","first_name":"Jane","last_name":"Doe","dob":"1980-10-16","addresses":[{"status":"current","address":"789 Any Avenue","city":"New York","state":"NY","zip":"33333","numberOfYears":"2"},{"status":"previous","address":"321 Main Street","city":"Hoboken","state":"NJ","zip":"44444","numberOfYears":"3"}]}

Right now I'm making this work in an ugly way with data.table, jsonify, and substr, like this:

df_split <- split(setDT(df_raw, key = "unique_row_index"), by = "unique_row_index")
json_vec <- lapply(df_split, FUN = (function(x){ y <- jsonify::to_json(x, unbox = TRUE, by = "row")); substr(y, 2L, nchar(y) - 1L}))
fwrite(json_vec, quote = FALSE, sep = "\n")

IMHO the best scenario for the BigQuery problem would be an option in to_ndjson to input a dataframe and spit back a list or unescaped character vector of JSON elements.

FWIW jsonify has already saved me a ton of time on this application, so thank you! All yet-another-json-thingy self deprecation aside, it's a great package.

Thanks for the use-case. It's useful to see how this would be used. I haven't used it myself, but are you aware of the bigrquery library?

Oh yeah, bigrquery is great for using BigQuery from an R client once there's a DB up and running. The issues I'm confronting now are mostly around ingestion pipelines. It seems like many of BigQuery's optimizations rely on tables with nested and repeated field schemas that amount to pre-joining data (which was pretty unfamiliar to me, coming from Postgres land). BQ can ingest not-just-row structured data from ndjson, Parquet, and Avro files, but most of my data pipelines are in R and ndjson is the friendliest solution.

Beyond the BigQuery case I'm seeing a lot more ndjson applications in the wild on larger datasets. A few data providers I work with are pushing it heavily for large spatial cases. E.g. R hits the 2^31 character vector length limits pretty quickly on big geojson files, and I suspect even geojsonsf and other heavily-optimized packages couldn't really work around that. Newline-delimited also means you can use fast CLIs like split to batch-size files to taste without any headaches. I'm becoming a big fan of the nd-format vs regular (geo)json.

Funny you should mention it, but I recently updated geojsonsf to handle larger character vectors in this issue

But I agree on the usefulness of ndjson and hope to implement this soon.

First proof-of-concept

df <- data.frame( x = 1:5, y = letters[1:5] )
cat( jsonify:::rcpp_to_ndjson( df ) )
{"x":1,"y":"a"}
{"x":2,"y":"b"}
{"x":3,"y":"c"}
{"x":4,"y":"d"}
{"x":5,"y":"e"}
N <- 1e6
data <- data.frame(
  longitude = runif(N, -180, 180),
  latitude = runif(N, -180, 180),
  tooltip = sample(LETTERS, N, replace = TRUE)
  , stringsAsFactors = F
)

microbenchmark::microbenchmark(
  json = {
    res <- jsonify:::rcpp_to_json( data )
  },
  ndjson = {
    res <- jsonify:::rcpp_to_ndjson( data )
  },
  times = 5
)

# Unit: seconds
# expr      min       lq     mean   median       uq      max neval
# json 1.314854 1.318062 1.351673 1.336339 1.372355 1.416754     5
# ndjson 1.480732 1.481594 1.502879 1.483283 1.485288 1.583497     5

res <- jsonify:::rcpp_to_ndjson( data )
str( res )
# chr "{\"longitude\":18.95450374111533,\"latitude\":-76.69491718523205,\"tooltip\":\"X\"}\n{\"longitude\":138.2996058"| __truncated__

cat( substr( res, 1, 500 ))
# {"longitude":18.95450374111533,"latitude":-76.69491718523205,"tooltip":"X"}
# {"longitude":138.2996058370918,"latitude":96.73179904930294,"tooltip":"K"}
# {"longitude":150.90493150055409,"latitude":41.85694866813719,"tooltip":"B"}
# {"longitude":52.576808631420139,"latitude":120.26858397759497,"tooltip":"M"}
# {"longitude":72.25942776538432,"latitude":-58.31666300073266,"tooltip":"W"}
# {"longitude":-35.5781511683017,"latitude":-87.4365452863276,"tooltip":"N"}
# {"longitude":-138.20940038189293,"latitude":3

data.frame by column

df <- data.frame(
  x = 1:5
  , y = letters[1:5]
)

cat( jsonify:::rcpp_to_ndjson(df, FALSE, -1L, TRUE, TRUE, "column") )
# {"x":[1,2,3,4,5]}
# {"y":["a","b","c","d","e"]}

I'm testing this out now, with remotes::install_github("SymbolixAU/jsonify", ref = "issue29"). So far everything's working great! Even on some large deeply nested dataframes, dfs with geojson embedded as strings, mixed types, etc, speed and memory consumption performance are both really solid.

I'm using some live data from our BigQuery pipe to test right now -- I'll try to sanitize it or generate some comparable noise and post the results this weekend.

@sheffe what do you think should be the result of to_ndjson() for nested lists:

lst <- list(
  x = 1:5
  , y = list(
    a = letters[1:5]
    , b = data.frame(i = 10:15, j = 20:25)
  )
)

lst

# $x
# [1] 1 2 3 4 5
# 
# $y
# $y$a
# [1] "a" "b" "c" "d" "e"
# 
# $y$b
#    i  j
# 1 10 20
# 2 11 21
# 3 12 22
# 4 13 23
# 5 14 24
# 6 15 25

should this return 2 lines for just x and y, or should the nested y be split into another two lines for a and b?

Interesting question. The short answer is I think this should throw an error or at least a warning -- reasoning below.

My use-cases are pretty narrow but I think the principle is pretty general. It's hard (and I've never seen it supported) for BigQuery loaders or other tools I've encountered to parallel-load a single big json object, even when that object is an array of many semantically-identical elements (eg they all describe rows, cols, or some constant nested object structure). It's trivially possible to parallelize ndjson that way by chunking the file into N-line batches and pushing them through whatever pipe is next. Doesn't need to be a DB loading pipe, could just be sed/awk or a validation logic or whatever. Beyond parallelization, converting differently-sized jsons to ndjson and then file splitting/merging are helpful for creating predictable chunk sizes for memory bottlenecks. Again that boils down to 'some downstream pipe requires more flexible chunking of inputs.' In the past I've used things like geojsplit or jq for 're-batching' jsons into similar pieces, but it's a pain compared to using ndjson in the first place.

With that in mind... I can pretty much only see a reason to use ndjson for a list type where the resulting json lines are similar pieces, and the motivation probably always boils down to caring about flexible chunking downstream. With dataframes, dfs containing list cols, matrices, etc., the by-row or by-col translation to ndjson neatly enforces that similarity. If an input list could be converted into a dataframe, e.g it was a dataframe with as.list called on it, fine; if some form of tabularizing/as.data.frameing would fail, then I suspect the result wouldn't make sense as ndjson.

The only use case I can come up with is exploratory analysis -- "I have no idea what's in my nested list, so I want to break it apart into pieces for easier grepping of components". But in that case, I usually try list flattening first at various max levels of nesting, so you get a long 2-col dataframe with key/value pairs, and writing that by row.

library(tidyverse)
library(jsonify)

set.seed(1234)

long_df <- tidyr::expand_grid(
  ABC = LETTERS,
  xyz = letters,
  num = 1:1000
) %>%
  dplyr::mutate(ftr1 = runif(n()),
                ftr2 = rpois(n(), 100))

nested_frame_level1 <- long_df %>%
  tidyr::nest(nest_level1 = c(num, ftr1, ftr2))

nested_frame_level2 <- nested_frame_level1 %>%
  tidyr::nest(nest_level2 = c(xyz, nest_level1))

writeLines(jsonify::to_ndjson(nested_frame_level1),
           "nestlvl1.jsonl")

writeLines(jsonify::to_ndjson(nested_frame_level2),
           "nestlvl2.jsonl")

I am starting to generate a lot of structures for BigQuery that look like nested_frame_level2 -- it amounts to embedding pre-joined data in nested/repeated fields, because data storage rounds to free and it speeds up queries by dramatically reducing total row counts. But even though it's a pretty nasty nested list, this sort of structure always comes back to 'data gets stuffed into a DB by row with arbitrary sub-pieces...'

On a slightly separate note -- I'm seeing equivalent speeds regardless of nesting complexity, and that's a wild improvement on where I've been so far in homebrewing my own!

microbenchmark::microbenchmark(
  to_ndjson(nested_frame_level1, unbox = TRUE),
  to_ndjson(nested_frame_level2, unbox = TRUE),
  times = 10
)

Screenshot 2020-01-30 12 43 32

Thank for the input.
I've decided I'm going to allow lists, non-recursively. Each of the top-level elements will be converted to a line of JSON

My reasoning is for cases where you legitimately have a list object of repeated elements, such as an sfc object, which can't be easily coerced into a data.frame

df <- data.frame(
  ml_id = c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2)
  , l_id = c(1,1,1,2,2,2,3,3,3,1,1,1,2,2,2)
  , x = rnorm(15)
  , y = rnorm(15)
  , z = rnorm(15)
  , m = rnorm(15)
)


sfc <- sfheaders::sfc_polygon(
  obj = df
  , x = "x"
  , y = "y"
  , z = "z"
  , m = "m"
  , polygon_id = "ml_id"
  )

as.data.frame( sfc )
# Error in as.data.frame.default(sfc) : 
#  cannot coerce class ‘c("sfc_POLYGON", "sfc")’ to a data.frame

to_ndjson( sfc, digits = 2 )
# [[[-0.84,-0.91,-0.24,1.63],[0.15,1.72,-0.63,-1.44],[-1.35,-1.36,-0.54,-1.42],[-0.09,0.75,-1.45,0.23],[0.76,1.9,0.11,-0.18],[-0.22,-0.96,-0.63,-2.26],[0.68,-1.8,0.65,0.61],[-0.29,1.27,-0.7,0.39],[0.84,0.07,-1.72,-0.33],[-0.84,-0.91,-0.24,1.63]]]
# [[[-0.81,0.28,1.47,-1.04],[-1.93,-1.97,0.62,-1.3],[0.05,0.54,0.31,1.3],[0.75,-0.81,1.07,-1.15],[0.19,0.96,0.34,1.85],[0.29,0.45,1.11,-0.89],[-0.81,0.28,1.47,-1.04]]]

I've decided I'm going to allow lists, non-recursively. Each of the top-level elements will be converted to a line of JSON

Yep, I like this decision a lot. Forcing non-recursive will absolutely stop me from dumb mistakes some day.

I've now got #58 working over there if you want to give it a try?

Just ran into this on to_ndjson in a production pipe:
Error in rcpp_to_ndjson(x, unbox, digits, numeric_dates, factors_as_string, : R character strings are limited to 2^31-1 bytes
This was an extremely large write and (remarkably) the first issue I've had in ~6Tb worth of json writes since I installed this branch, over lots of different data types/sizes/nesting structures/etc. So far it's all just working.

As the error suggests, this is an issue with R strings, and so writing a large JSON to a string doesn't work. So we'll need to make some sort of stream output (like jsonlite) to write the json line-by-line to a connection.

Made this comment on the state of these branches and where some of the confusion lies.

Basically, branch issue58 is ahead of issue29, where I cleaned the final trailing \n

List elements need to keep their name

lst <- list(x = 1)
to_json( lst )
# {"x":[1.0]}
to_ndjson( lst )
# [1.0]
## this should be {"x":[1.0]}

So this now gives

lst <- list(
  x = 1:5
  , y = list(
    a = letters[1:5]
    , b = data.frame(i = 10:15, j = 20:25)
  )
)

to_ndjson( lst )
{"x":[1,2,3,4,5]}
{"y":{"a":["a","b","c","d","e"],"b":[{"i":10,"j":20},{"i":11,"j":21},{"i":12,"j":22},{"i":13,"j":23},{"i":14,"j":24},{"i":15,"j":25}]}}

which is slightly different to before, but I think this is right as it maintains list element names.