RConsortium/marshalling-wg

Want to join this working group?

Opened this issue Β· 11 comments

Hi all,

let us know if you'd like to join this working group on 'Marshaling and Serialization in R'. To join, just add a comment below with a very brief introduction of yourself and your interest is in this topic.

Thanks for the invitation to join. . Happy to help in any way I can!

R's Serialisation powers some things I've written: e.g. {xxhashlite}, rlang::hash(). I'm also interested in some low level aspects of R e.g. {rbytecode}.

Hey yallβ€”more than happy to tag along. Simon from Chicago, IL, USAβ€”I work on packages for predictive modeling at Posit.

We end up thinking about marshaling/serialization a good bit in the context of model deployment and training in parallel. Re: model deployment, we put together a package for marshaling model objects last year that standardizes interfaces to serialization methods from different modeling packages. The issue of serialization also comes up in our support for parallel processing, where various models may be fitted in several R processes but handed back to a parent process for analysis.

cc @juliasilge and @topepo. I won't be able to make it to useR! in person, but Max will be there for the in-person meeting.

Not sure how much time I'll have to participate, but I'll try to follow what goes on and chip in from time to time. I developed the current serialization framework a number of years ago. The main goals of that redesign of what came before were to support parallel computation and separate loading of objects in a collection while maintaining identity of mutable objects, mainly environments (i.e. lazy loading).

mirai provides an implementation of what is currently possible using the serialization framework @ltierney describes above. Specifically, it interfaces at the C level with the β€˜refhook’ system for reference objects, supporting their use in parallel and distributed computing.

This feature was originally motivated by parallel computations involving torch tensors, as described in https://shikokuchuo.net/mirai/articles/torch.html, and following helpful discussions with @dfalbel.

Permitted usage was subsequently broadened to a much wider class of serialization functions, as described in https://shikokuchuo.net/mirai/articles/mirai.html#serialization-arrow-polars-and-beyond, which also benefited from input by @eitsupi.

Finally, it also allows hosting of ADBC database connections in parallel processes as described in https://shikokuchuo.net/mirai/articles/databases.html, where @krlmlr was instrumental in proposing and verifying this use case.

Thanks for the invite. I develop targets and crew, both of which rely on sending objects to concurrent R processes. targets lets you select or customize a "format", which is a storage type that covers serialization and marshaling. It works, but it is not implicit, and some users have struggled with the extra responsibility.

Thanks for the invite. I'm one of the developers in BiocParallel and SharedObject, which provide a parallelization framework to all Bioconductor packages. I have been thinking about serialization for a while. I think one interesting topic is how we can serialize/unserialize only once in each computer and make the object available to all workers on the same computer. The current solution is ignoring the fact that multiple workers are on the same computer and sending the object to each worker. This is clearly an unnecessary waste of the resources we have in distributed computing. I do not know what the best solution could be but I'll be happy to see any idea.

Great to meet with you @wlandau. I am using your package targets to manage my data extraction pipeline. It is incredibly helpful. Frankly speaking, I am the person who struggled with the extra responsibility you mentioned. I like it, but also hate it. I might open a thread in your repository to discuss the automation of the format selection :)

Hi all,

Tomasz here from mlverse at Posit. I'm more than happy to help out too. I am particularly interested in making sure S7, reticulate, TensorFlow/Jax/Keras, torch, and things like it (R external ptrs, potentially complex environment requirements) work well with whatever the final solution is.

@HenrikBengtsson can you add @t-kalinowski to the team, I don't have the rights to do so

Welcome Tomasz - great to have you on board.

@HenrikBengtsson can you add @t-kalinowski to the team,

Invite sent for the https://github.com/orgs/RConsortium/teams/marshalling-wg/ team.

... I don't have the rights to do so

Hmm... that's odd. You've got the "admin" role for this repository, which is the same as I have. It's also the highest possible I can set - there is no "owner" option. It could be that I have higher privileges through my formal membership of the R Consortium, which is why I can send the invite.

I think that if someone is not part of the org yet then an owner needs to invite them to the org first before a team admin can do any addition. Feel free to email operations@r-consortium.org whenever you need someone added.

very keen for this, I've been scratching around for how to access the equivalent of the Python numcodecs package

I an RSE at the Australian Antarctic Division, I'm working on getting better access to Zarr in R, and this involves a clear set of compress/decompress steps from objects or files or byte ranges of them

there's some prior art with {archive}, Bioconductor {Rarr}, github keller-mark/{pizzarr}, and obviously Arrow, I'm happy to see zstd added to mem(De)Compress in upcoming R but I think we need a wider net