eddelbuettel/rprotobuf

Read/write data frame (tibble) with a proto file (not rexp.proto)?

josiekre opened this issue · 8 comments

Reading "RProtoBuf: Efficient Cross-Language Data Serialization in R", I'm having a hard time understanding the specifics. I see that an arbitrary tibble can be serialized. It can then be read back in.

msg <- RProtoBuf::serialize_pb(dplyr::as_tibble(iris), NULL)
identical(dplyr::as_tibble(iris), RProtoBuf::unserialize_pb(msg))

But how do we get from a binary file with an associated proto to a tibble? For example,

message <- tutorial.Person$read(tf1)
...
"tbl_df" %in% is(message)

What we want to do is operate in R on tabular data (dplyr functions) using a predefined schema. We have a tech stack that uses R, Python, and Java. So far we've been passing around CSV or JSON files, but there's no way to enforce specific data types, and it is causing a lot of problems for us (which is the problem outlined in the paper referenced above). I do not understand from the article how to work with data frames based on a predefined .proto file.

For example, we might have a simple message like this that represents some altered US Census data we want to read into R or Python or Java...

package blahblah;

message household { 
  required string puma = 1; 
  required int32 np = 2; 
  required string serialno = 3; 
  required double hhinc = 4; 
  required double wgtp = 5; 
  required double income = 6; 
  required int32 veh = 7; 
  required int32 st = 8; 
} 

How do we get a binary file holding this tabular data into a data frame in R?

There's the manual way of writing a method like this for each message:

as.tibble.household.Message <- function(x) {
  dplyr::data_frame(
    puma = x$puma, 
    np = x$np, 
    etc...
  )
}

But is there a way to generalize this?

Let's slow down and do this one message at a time.

The fundamental issue here (as well as with, say, database accessors) is that you have to think of sequence of such messages. Those are (conceptually) many rows. So you are stacking rows.

But a data.frame (and derived types) really are vectors of columns. So you have to loop over your rows, and for each element i loop over all its elements j and stick them into data.frame(i, j) -- if your types are right.

So you just have to write this looping / unwinding / repopulating code. IThere is no obvious way to automate this. Whether the data comes from a file or network does not matter. Each chunk is one 'blob' of household (say) and you have to put the i-th chunk in row i of your data.frame container.

Makes sense?

I've made some progress. Let's assume we create an address book like this:

> a <- tutorial.AddressBook$new(
    person = c(
      tutorial.Person$new(name = "Sue"), 
      tutorial.Person$new(name = "Bob")
    )
  )

> a$person
[[1]]
message of type 'tutorial.Person' with 1 field set

[[2]]
message of type 'tutorial.Person' with 1 field set

In theory, if we have a series of flat messages like my household example proto file above we could create a data frame like this:

> dplyr::bind_rows(lapply(a$person, as.list))

However, with person we have list columns that would require some purrr::map (I think). For simplification right now, I'll remove those and then build the data frame:

> x <- lapply(a$person, function(i) {
    y <- as.list(i)
    y$phone <- NULL
    y
  })

> dplyr::bind_rows(x)
# A tibble: 2 x 3
  name     id email
  <chr> <int> <chr>
1 Sue       0 ""   
2 Bob       0 ""  

How do we save back to the same protobuf after some munging? This might be helpful: Writing R data.frames out to serialized lists of protocol buffers.

To get back to a serialized list of protocol buffers:

# Load functions at http://lists.r-forge.r-project.org/pipermail/rprotobuf-yada/2011-June/000202.html

> df <- dplyr::bind_rows(x)
> pb <- dataFrameToProtoBufs(df, "/path/to/addressbook.proto", "tutorial.Person")

If you want that as an address book:

> new(tutorial.AddressBook, person = pb)

@eddelbuettel Do you have any further tips?

I apologize but I may not have the time to write an ad-hoc and on-demand tutorial for you here.

"Serialization" is a big topic. You will find copious tutorials and write-ups online. I still recommend that you try to think a bit more in terms of 'atomic' operations here -- C++ knows nothing about dplyr, tibbles, .. so maybe try not to focus so much on those.

I will close this as there is no actual deficiency in the package.

Fair. Thanks for the thoughts.

I'm only trying to make sense of this after building lots of R analyses on tabular data, and I did not know the right language to Google.

This is actually both reasonably hard and reasonably powerful stuff. I do encourage you to experiment and read around. Eg I just finished a paper with a coauthor on RcppMsgPack -- another efficient yet protocol-less scheme.

At the very beginning of this, I had an example of turning a long vector into a PB message. Turns out it was not efficient. ProtocolBuffers is still very widely used and powerful, but you may have to come to terms to how it is used. Most current without Google etc at https://grpc.io which add the (missing at PB) networking layer.