eddelbuettel/rprotobuf

RProtoBuf cannot deserialize certain invalid R objects

Closed this issue · 7 comments

R version 3.3.2 (2016-10-31)
RProtoBuf_0.4.9

When using this library to serialize and deserialize certain lm objects:

x <- factor(c("a","b","c","a","c"))
y <- 1:5
z<-lm(x ~ y)
serialized <- RProtoBuf::serialize_pb(object=z, connection=NULL)
RProtoBuf::unserialize_pb(serialized)

I get the following error:

Error in attributes(xobj) <- attrib : adding class "factor" to an invalid object

This is because lm is creating an invalid r object (a factor class with non integer values):

z$residuals
Error in as.character.factor(x) : malformed factor`

Although these calls to lm are invalid, It would be nice for the library to be able to support this since both read and readRDS both do. It would make sense for the library to be able to recreate the state of the object regardless of validity

I'm presuming this will most likely require part of the deserialization code to be written in C?

PRs are always welcome :)

Any suggestions on where this might fit into the C source code?

So the bug seems to be the lack of support for factor types as a plain lm() on numeric data serializes just fine:

> y <- rnorm(10)
> x <- runif(10)
> fit <- lm(y ~ x)
> serialized <- RProtoBuf::serialize_pb(object=fit, connection=NULL)
> 

Factors are somewhere between tricky (to get levels rights) and simple (really just an integer vector). We also don't have them in Rcpp itself which may be part of the problem.

I think it is more to do with the deserialization side. unrexp sets the attributes of the deserialized object within R itself which prevents the object from being recreated.

I was able to get the above example to deserialize properly by bypassing r's checking of factor types and setting the class attribute directly.

   unsafe_set_class <- cfunction(c(vec = "ANY",klass="ANY"), '
            PROTECT(vec);
            PROTECT(klass);
            SEXP t = R_NilValue;
            if(TYPEOF(vec) == CHARSXP)
            	error("cannot set attribute on a CHARSXP");
            /* this does no allocation */
            for (SEXP s = ATTRIB(vec); s != R_NilValue; s = CDR(s)) {
            	if (TAG(s) == R_ClassSymbol) {
            	    SETCAR(s, klass);
                    return R_NilValue;
                }
                t = s;
            }

            SEXP s = CONS(klass, R_NilValue);
            SET_TAG(s, R_ClassSymbol);
            if (ATTRIB(vec) == R_NilValue) SET_ATTRIB(vec, s); else SETCDR(t, s);
            SET_OBJECT(vec, 1);
            UNPROTECT(2);
            return R_NilValue;
        ')

   unsafe_set_class(xobj,as.character(attrib["class"]))

It seems lm must do something similar when setting the attributes of the object. Could this be a viable workaround?

Dunno.if.

I also thought about this a little during the day and factors are also of limited interest to Protocol Buffers when transferring between R and C++/Java/Python/... none of which have factor types.

Of course, as a workaround you could always serialize to raw using digest() and then pass the raw vector via ProtoBuf.

We still would like the deserializer to not crash in these circumstances as the object was created by native R functions.

Would you accept a pull request which specifically handles this case on the deserializer side which skips setting the class attribute if the values are invalid?

Thanks very much for your PR #29 which fixes this.

As it happens, CRAN just asked for another small fix I will make shortly so this will go out "shortly" in a fresh 0.4.10 release. Thanks again, this was very helpful.