/*
- Copyright 2018 Christian Arnault
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
-
http://www.apache.org/licenses/LICENSE-2.0
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License. */
Multi Language Interfaces
This product proposes some way to interface different coding languages implied in the process of scientific programming in the context of distributed programming such as Apache Spark, and in particular, this concerns the development around the AstroLab Sofware organization, or developments using it.
This describes the issues, problems, tools implied in interfacing various languages such as:
- C
- C++
- Fortran
- Java
- Scala
- Python
Generally speaking, since we are mainly concerned with the use of Apache Spark, which is mainly coded in Scala, we mainly try to interface all scientific oriented languages to Scala.
Contents:
- Multi Language Interfaces
Using JNA
JNA is a new generation of tools meant to interface JVM with native (compiled) code. It uses JNI to dynamically (using the introspection features of Java) connect to compiled and linked entry points.
Basic data types (scalars and arrays) are directly usable, including ByReference access.
Complexed types (structures) need the construction of a Java equivalent Structure which requires an ordering operation for the structure fields only available in Java, and based upon the introspection from Java.
References for JNA
- https://github.com/java-native-access/jna
- https://maven.java.net/content/repositories/releases/net/java/dev/jna/jna/
- https://java-native-access.github.io/jna/4.2.0/overview-summary.html
- http://jnaexamples.blogspot.com/2012/03/java-native-access-is-easy-way-to.html
- https://www.sderosiaux.com/2016/08/03/jna-java-native-access-enjoy-the-native-functions
How to link C/C++/Fortran -> Scala
To explain we start from a set of examples.
-
we consider a library with offering entry points (C/C++/Fortran)
-
so far the following types haves been tested:
C Scala
-------------------------
int Int
double Double
double[] Array[Double]
-
How to declare those entry points using JNA:
-
import the JNA stuff:
import com.sun.jna.{Library, Native, Platform}
- create a Scala trait to declare the signature of all external entry points, using a Scala syntax
// declaring to JNA
trait EntryPoints extends Library {
def mysum(x: Int, y: Int): Int
def mymultiply(x: Double, y: Double): Double
def myarray(x: Array[Double], arrayln: Int): Unit
}
- install, within a Scala object, the loading actions to load the shared libraries implementing the external functions
object Libraries {
def sum = Native.loadLibrary("sum", classOf[EntryPoints]).asInstanceOf[EntryPoints]
def mul = Native.loadLibrary("mul", classOf[EntryPoints]).asInstanceOf[EntryPoints]
}
- Calling the external functions:
val r1 = Libraries.sum.mysum(1, 2)
val r2 = Libraries.mul.mymultiply(1.111, 2.222)
- Exchanging arrays
We may pass an array to the external code. The array can be modified in place.
suppose yu create a C function that changes an array as follows:
void myarray(double array[], int arraylen) {
int i = 0;
for (i=0; i < arraylen; i++) { array[i] *= 2; }
}
val a = Array[Double](...)
Libraries.mul.myarray(a, a.length)
- Overhead in calling a C function:
Of course, using JNA has a cost. Comparing calling the math "cos" function from straight Scala and using the C cos through JNA:
-
scala cos> 0.34 µs
-
C cos> 5.3 µs
Exchanging structures
At a first approach we consider the exchange through pointers.
A C structure will be declared as a derived Java/Scala class of the jna's abstract "Structure" class. This class only has to implement one abstract method: "getFieldOrder" that returns a List giving the ordered list of field names of the structure (to help JNA to perform introspection to the C objects).
Once this is done, referenced objects can be used in Java/Scala from/to C/C++.
Using values by reference (ie: using pointers)
A value (in the Scala/Java world) can be viewed/transmitted by reference using the com.sun.jna.ptr.IntByReference (and XxxByReference for other Scala/Java).
(cf. https://java-native-access.github.io/jna/4.2.1/com/sun/jna/ptr/ByReference.html)
====================== C ===============================
void modify(int* ptr);
void modify(int* ptr) {
*ptr = 12;
}
=========================================================
====================== Scala ===========================
import com.sun.jna.ptr.{IntByReference}
trait EntryPoints extends Library {
def modify(ptr: IntByReference)
}
...
val ptr = new IntByReference(10)
Libraries.native.modify(ptr)
println(s"ptr = ${ptr.getValue}")
...
==========================================================
How use external functions in a Spark pipeline
The principle is to dynamically load the shared libraries right when it's needed, ie. within the lambda, executed in the Spark operation (map/reduce/...) right when it's needed, ie. before calling the external functions.
In addition, all shared libraries has to be declared using the --files <path/xxx.so>
option of the spark-submit
command line.
val l = sc.parallelize((1 to 10)).map(x => {LibraryLoader.loadsum; Libraries.sum.mysum(x, 12)})
It should be noted that the loader operation will ensure that the shared library(ies) will be serialized, then transparently deployed to all workers
Issues related with C++
The Jna's API is only able to understand C types. Then when dealing with C++ coding, a mangling is applied to function names (to support mutiple function signatures !!). The declaration of native functions in the Scala/Java world has to worry about the exact external naming scheme, thus the mangling has to be considered.
For both reasons it's asked to construct one C flavour of all C++ external functions. Or at least declare them in a extern "C" block when possible (when the signature is compatible with a C syntax).
An other important aspect of the C/C++ interfacing to Scala/Java world, is the fact that generally non C/C++ scalar types have to be manually allocated (on the heap) and when these objects have to be returned back to the Java world, it will be required to take care of their deallocation.
The next example shows both the mangling aspect and the manual allocation of the returned value.
We consider une function that concatenates two std::string objects and return the result.
The primary C++ function that operates this operation can be written as follows (we first consider a version that returns nothing):
#include <iostream>
#include <string>
using namespace std;
void _myconcat (const string a, const string b)
{
const string r = a + b;
cout << "_myconcat> " << " a=" << a << " b=" << b << " r=" << r << endl;
}
Of course this function will be mangled and contains C++ types not directly usable in Java. Thefore a C interface flavour has to be introduced:
#include <iostream>
#include <string>
using namespace std;
extern "C" {
void myconcat (const char* a, const char* b);
}
void _myconcat (const string a, const string b):
void myconcat (const char* a, const char* b)
{
cout << "myconcat> " << " a=" << a << " b=" << b << endl;
_myconcat(string(a), string(b)).c_str();
}
void _myconcat (const string a, const string b)
{
const string r = a + b;
cout << "_myconcat> " << " a=" << a << " b=" << b << " r=" << r << endl;
}
If we look at the public entry points, we understand why the C++ entry point cannot declared as it is:
> nm libmy_udf.so | egrep " T "
...
00000000000010b2 T myconcat
...
000000000000122e T _Z9_myconcatNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES4_
...
Then we change the function so as it returns the result of the concatenation:
#include <iostream>
#include <string>
using namespace std;
extern "C" {
const char* myconcat (const char* a, const char* b);
}
const string _myconcat (const string a, const string b);
const char* myconcat (const char* a, const char* b)
{
static const char* r = _myconcat(string(a), string(b)).c_str();
cout << "myconcat> " << " a=" << a << " b=" << b << " r=" << r << endl;
return r;
}
const string _myconcat (const string a, const string b)
{
const string r = a + b;
cout << "_myconcat> " << " a=" << a << " b=" << b << " r=" << r << endl;
return r;
}
if we run this implementation:
int main()
{
const string result = _myconcat("aa", "bb");
cout << "Result = " << result << endl;
const char* result2 = myconcat("aaa", "bbb");
cout << "Result = " << result2 << endl;
return 0;
}
> ./mytest
_myconcat> a=aa b=bb r=aabb
Result = aabb
_myconcat> a=aaa b=bbb r=aaabbb
myconcat> a=aaa b=bbb r=a
Result = @@@
The result is lost because the local result computed from the C function is lost after the scope is closed (this is not true for the C++ function since the string object is returned by value)
Thus we have to mimic the return by value mechanism by creating a string object (that eventually be destroyed).
#include <iostream>
#include <string>
#include <string.h>
#include <stdlib.h>
using namespace std;
extern "C" {
const char* myconcat (const char* a, const char* b);
void myfree(const void* str);
}
const string _myconcat (const string a, const string b);
const char* myconcat (const char* a, const char* b)
{
static const char* r = _myconcat(string(a), string(b)).c_str();
void* rr = malloc(strlen(r) + 1);
cout << "myconcat> " << " a=" << a << " b=" << b << " r=" << r << " pointer" << rr << endl;
strcpy((char*) rr, r);
return (char*) rr;
}
void myfree(const void* str) {
cout << "myfree> " << " pointer=" << str << endl;
free((void*) str);
}
const string _myconcat (const string a, const string b)
{
const string r = a + b;
cout << "_myconcat> " << " a=" << a << " b=" << b << " r=" << r << endl;
return r;
}
Now the test program will behave properly:
> mytest
_myconcat> a=aa b=bb r=aabb
Result = aabb
_myconcat> a=aaa b=bbb r=aaabbb
myconcat> a=aaa b=bbb r=aaabbb pointer0x1ad2c30
Result = aaabbb
myfree> pointer=0x1ad2c30
Various tutos to explicit use cases
This tuto directory includes:
- One C module: src/native_udf.c + src/native_udf.h built into the linux directory
- One Java file to declare the entry points: ca/CUDF.java
- One Java application : ca/App.java
- One Makefile implementing some targets:
- make clean
- make lib
- make classes
- make all = lib + classes
- make run triggers all
Using the repository
This development tries to apply the explanations written in this document. This is a SBT based structure, ie. sources are located in the "src" directory, with the following structure:
src/C C or C++ sources
src/main/java Java sources
src/main/scala Scala sources
src/main/python Python sources
src/test/scala Unit test scala sources
Java & Scala sources are organized in the "com.astrolab.Interfaces...." packages.
At the top level, are the management tools:
- Makefile to build the C/C++ module as a shared library ("libmy_udf.so"). All C/C++ modules are gathered within one single shared library.
- build.sbt (together with project/* configuration files for SBT) to build and test the Scala elements.
- run.sh, shell script to run the Spark based application.
Compiling & building
Building the shared library grouping all C/C++ modules:
> make clean all test
This will create the ./libmy_udf.so shared library containing all C/C++ entry points callable from Java/Scala/Spark
Building the Scala applications:
> export LD_LIBRARY_PATH=`pwd`
> sbt package
> sbt test
Running the test program:
> export LD_LIBRARY_PATH=`pwd`
> ./run.sh
Utilisation du package Jep pour interfacer Scala et Python.
Références:
One example
val jep = new Jep(new JepConfig().addSharedModules("numpy"))
jep.eval("import numpy as np")
val arraySize = 1000000
jep.set("x", 10)
jep.getValue("x")
jep.eval("y = np.random.rand(2, 3)")
jep.getValue("y.shape")
jep.eval("z = np.random.rand(arraySize)")
jep.getValue("z.shape")
{
val f = Array.fill(arraySize)(Random.nextFloat)
val nd = new NDArray[Array[Float]](f, arraySize)
jep.set("t", nd)
}
Result of the bench:
x=10> Elapsed time: 0.276785568 µs
getValue(x)> Elapsed time: 7.149102638 µs
y = np.random.rand(2, 3)> Elapsed time: 20.37042373 µs
getValue(y.shape)> Elapsed time: 11.65154456 µs
z = np.random.rand(1000000)> Elapsed time: 12.649986593 ms
getValue(z.shape)> Elapsed time: 11.224750006 µs
xfer array Elapsed time: 14.170212113 ms
Example with matplotlib
import jep._
object Tester {
def plot: Unit = {
println("plot")
val jep = new Jep(new JepConfig().addSharedModules("numpy", "matplotlib"))
jep.eval("import numpy as np")
jep.eval("import matplotlib")
jep.eval("matplotlib.use('Agg')")
jep.eval("import matplotlib.pyplot as plt")
jep.eval("t = np.arange(0.0, 2.0, 0.01)")
jep.eval("s = 1 + np.sin(2 * np.pi * t)")
jep.eval("fig, ax = plt.subplots()")
jep.eval("ax.plot(t, s)")
jep.eval("fig.savefig('test')")
}
def main(args: Array[String]): Unit = {
plot
}
}
With sbt
We suppose that we got jep from "pip install --user jep". Then build.sbt will look like:
name := "testjep"
version := "0.1"
scalaVersion := "2.11.8"
unmanagedBase := file("/home/christian.arnault/.local/lib/python3.5/site-packages/jep")
> sbt clean assembly "runMain ca.Tester"