OpenMined/JavaDP

Proposal for Java implementation of Google DP library

simcof opened this issue ยท 9 comments

Proposal: Java implementation of Google DP library

Author(s): [Benjamin Szymkow, Jonathan Passerat-Palmbach, Lukasz Kozuchowski]

Last updated: 29th March, 2020

Ongoing discussion on the OpenMined Slack. DM @ben S for details.

Abstract

Enabling app developers using the Java familiy of languages (in particular Kotlin and Scala) to make use of Google's Differential Privacy library is critical to addressing the privacy concerns. Sharing de-identified data whilst preserving privacy of app users is critical for the global response to the COVID-19 pandemic we are currently faced with.

Background

There are many approaches to achieving this goal. This proposal will detail potential solutions prior to this project commencing.

Proposal

[A precise statement of the proposed change.]

Rationale

A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach for enabling Java/Kotlin/Scala users with Googles differential privacy library.

Requiremens

The solution needs

Implementation

[A description of the steps in the implementation, who will do them, and when.]

Open issues (if applicable)

[A discussion of issues relating to this proposal for which the author does not
know the solution. This section may be omitted if there are none.]

Here's what I have right for now. Quotes are from an oldish, but I believe still relevant, Stack Overflow discussion.

JNA is much slower than JNI, but much easier. If performance is not an issue use JNA.

Some people claim they had slowdowns between 3x and 40x in different scenarios when moving from JNI to JNA.

JNA is for C and doesn't support C++ classes and methods. A workaround for JNA is to use wrapper C++ functions declared with "extern C". I suspect templates would be an issue if we tried JNA.

Someone claimed:

JavaCPP is as easy to use as JNA, but as fast as raw JNI.

Sounds nice!

My experiments from today -> as I was saying on Slack, I followed the SWIG route.

TLDR -> What I have:

  • Util and StatusOr modules from Google's DP wrapped in SWIG
  • tested code generation for python and Java
  • code for Java target successfully reused from Scala
  • added a separate example files to the upstream C++ codebase, to compare outputs

๐Ÿš€ You can find my demo here: https://github.com/jopasserat/differential-privacy ๐ŸŽข

More details:

I embedded my demo in the C++ repo to avoid having to deal with the build artefacts via Bazel (not much experience).

  • New example file prints the default value of epsilon and calls another function.
  • So does the Java codebase
  • Scala script reuses the previous Java module
  • Python script calls a few more functions from the Util module
  • two build scripts are available to replay the demo:
    • pre-requisites: JDK and header files, Python headers, swig CLI, bash
    • google's DP library successfully built in repo's directory (to link to the built artefacts)

Challenges

  • Integrate SWIG with Bazel
  • do more tests to make sure everything can be covered

Benefits:

  • single wrapper for all languages

Integration of SWIG with bazel won't be an issue. Tensorflow itself uses SWIG, but they did move python code to pybind11.
We should explore how TF does Java bindings.

According to their readme, they point to JNI in their installation requirements. SWIG for Java does also generate JNI so ๐Ÿคทโ€โ™‚๏ธ

Yeah I've seen TF switched to pybind11 and I think it's worth taking their reasons into consideration here: (4 points copy-pasted from the TF PR)

  • code readability -> debatable: from what i've seen pybind's syntax is pretty verbose
  • build times -> makes sense for TF, but is that really much of an issue for a small lib?
  • binary size -> same as above, however I think the Kotlin/Android Swift/iOS folks are better positioned to tell us what's decent or not here
  • performance of the Python API -> premature optimisation is the root of all evil ๐Ÿ˜‰

To be fair, it doesn't have to come as a replacement to pybind11, I was just wondering if it would make more sense to focus the efforts on getting a comprehensive wrapper ready quicker. And in that context, it seems that SWIG can help do it for most of the languages we're targetting (JS also covered).

I kind of like the idea to have a single source of truth for a wrapper to lower the maintenance effort when the upstream lib gets modified. However, I'd be happy to get further input on SWIG, I might be missing an obvious drawback and you seem to have had to make this choice before opting for pybind11 -> any tips would be valuable @chinmayshah99

@jopasserat
Hi,
Your demo looks great! Unfortunately I failed to run it. Both Java and Python building scripts give me:

g++: error: carrot_wrap.cxx: No such file or directory

Am I doing something wrong or is there a file missing?

there must be an error earlier in the script ๐Ÿค”
can you please try running the build script again with bash -x build_python.sh?

+ swig -python -c++ -outdir python -o python/carrot_wrap.cxx carrot.i
../differential_privacy/base/statusor.h:109: Warning 362: operator= ignored
../differential_privacy/base/statusor.h:114: Warning 362: operator= ignored
../differential_privacy/base/statusor.h:126: Warning 362: operator= ignored
../differential_privacy/base/statusor.h:128: Warning 362: operator= ignored
../differential_privacy/base/statusor.h:153: Warning 362: operator= ignored
../differential_privacy/base/statusor.h:162: Warning 362: operator= ignored
../differential_privacy/base/statusor.h:172: Error: Syntax error - possibly a missing semicolon.

I get the same Error for the JVM script but without the preceding warnings. I see nothing weird in statusor.h and I haven't changed it. The swig commands are:

swig -java -c++ -outdir jvm -o jvm/carrot_wrap.cxx carrot.i
swig -python -c++ -outdir python -o python/carrot_wrap.cxx carrot.i

I built the google library successfully.

I've just tried from a fresh clone and it works on my machine (Debian testing), we must have some discrepancy between our setups.

what version of swig do you have? mine is 4.0.1

I had swig 3.0.12 on my Ubuntu 18.04. I updated to 4.0.1 (compiled from Debian testing sources) and it's almost working now ๐Ÿ™ƒ I'm still somehow missing libhashtablez_sampler.so to run it but I can probably figure it out.