twosigma/beakerx

implement autotranslation with arrow

scottdraves opened this issue · 3 comments

followup to #5039

use arrow and shared memory.

Looking forward to it.

Hello, my name is Murray Davis. I am a developer with AdTheorent (http://adtheorent.com/). I have just made a POC that creates a Pandas DataFrame in a Jupyter Notebook, converts it to an ArrowRecordBatch with PyArrow, and stores the ArrowRecordBatch in a Plasma Object Store, i.e. shared memory. Then, it invokes through Py4J a Scala/Java application to fetch the ArrowRecordBatch from the Plasma Object Store with the Java Plasma Client, convert it into our own DataFrame format, and learn the data with our own real-time Java machine learner.

We successfully tested it with multiple Python threads after synchronizing the code segment that uses the Python Plasma Client, since it is not thread-safe. (We are using Arrow 0.9.0.)

Our motivation was to improve the performance of a Python Jupyter Notebook when integrated with our Java machine learner. We had found that transmitting serialized Pandas DataFrames through Py4J is a significant bottleneck.

Our POC has been successful in significantly improving the speed of our integration.

We have very recently become aware of BeakerX, and I have only started to try it out this week. We can see that BeakerX may have the potential to host our Python/Java integration with Arrow Plasma.

Does it sound like our work could contribute to the implementation of autotranslation with Arrow?

"found that transmitting serialized Pandas DataFrames through Py4J is a significant bottleneck."

(context outside beakerx) I used jep https://github.com/ninia/jep NDArray to share zero copy java NIO buffers between JVM (off-heap memory) getting data from spark and calling Python which is doing no IO.

So, again, used in-memory integration and no remoting or copying large data as with with py4j.