This tool allows C# UDF to be registered and invoked from Scala or Python in Apache Spark.
- Open
/src/csharp/UdfSerializer.sln
. - Update UDF definition defined in
Main
inProgram.cs
. The example used througout the instruction is the following:
Udf<string, string>(str => "my udf: " + str);
- Build and run. You will see the following output:
[2019-08-09T18:43:40.6919509Z] [TERRYK-MSFT] [Debug] [ConfigurationService] Using the environment variable to construct .NET worker path: E:\source\repos\GitHub\imback82\spark\artifacts\bin\Microsoft.Spark.Worker\Debug\netcoreapp2.1\win-x64\publish\Microsoft.Spark.Worker.exe.
serializedUdf=AAAAA1JvdwAAAANSb3cAAAABTgAABnsAAQAAAP////8BAAAAAAAAAAwCAAAAUk1pY3Jvc29mdC5TcGFyaywgVmVyc2lvbj0wLjQuMC4wLCBDdWx0dXJlPW5ldXRyYWwsIFB1YmxpY0tleVRva2VuPWNjN2IxM2ZmY2QyZGRkNTEFAQAAADFNaWNyb3NvZnQuU3BhcmsuVXRpbHMuQ29tbWFuZFNlckRlK1VkZldyYXBwZXJEYXRhAgAAACA8VWRmV3JhcHBlck5vZGVzPmtfX0JhY2tpbmdGaWVsZBU8VWRmcz5rX19CYWNraW5nRmllbGQEBDNNaWNyb3NvZnQuU3BhcmsuVXRpbHMuQ29tbWFuZFNlckRlK1VkZldyYXBwZXJOb2RlW10CAAAAKE1pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStVZGZEYXRhW10CAAAAAgAAAAkDAAAACQQAAAAHAwAAAAABAAAAAQAAAAQxTWljcm9zb2Z0LlNwYXJrLlV0aWxzLkNvbW1hbmRTZXJEZStVZGZXcmFwcGVyTm9kZQIAAAAJBQAAAAcEAAAAAAEAAAABAAAABCZNaWNyb3NvZnQuU3BhcmsuVXRpbHMuVWRmU2VyRGUrVWRmRGF0YQIAAAAJBgAAAAUFAAAAMU1pY3Jvc29mdC5TcGFyay5VdGlscy5Db21tYW5kU2VyRGUrVWRmV3JhcHBlck5vZGUDAAAAGTxUeXBlTmFtZT5rX19CYWNraW5nRmllbGQcPE51bUNoaWxkcmVuPmtfX0JhY2tpbmdGaWVsZBc8SGFzVWRmPmtfX0JhY2tpbmdGaWVsZAEAAAgBAgAAAAYHAAAA/wFNaWNyb3NvZnQuU3BhcmsuU3FsLlBpY2tsaW5nVWRmV3JhcHBlcmAyW1tTeXN0ZW0uU3RyaW5nLCBTeXN0ZW0uUHJpdmF0ZS5Db3JlTGliLCBWZXJzaW9uPTQuMC4wLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49N2NlYzg1ZDdiZWE3Nzk4ZV0sW1N5c3RlbS5TdHJpbmcsIFN5c3RlbS5Qcml2YXRlLkNvcmVMaWIsIFZlcnNpb249NC4wLjAuMCwgQ3VsdHVyZT1uZXV0cmFsLCBQdWJsaWNLZXlUb2tlbj03Y2VjODVkN2JlYTc3OThlXV0BAAAAAQUGAAAAJk1pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStVZGZEYXRhAwAAABk8VHlwZURhdGE+a19fQmFja2luZ0ZpZWxkGzxNZXRob2ROYW1lPmtfX0JhY2tpbmdGaWVsZBs8VGFyZ2V0RGF0YT5rX19CYWNraW5nRmllbGQEAQQnTWljcm9zb2Z0LlNwYXJrLlV0aWxzLlVkZlNlckRlK1R5cGVEYXRhAgAAAClNaWNyb3NvZnQuU3BhcmsuVXRpbHMuVWRmU2VyRGUrVGFyZ2V0RGF0YQIAAAACAAAACQgAAAAGCQAAAAw8TWFpbj5iX18wXzAJCgAAAAUIAAAAJ01pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStUeXBlRGF0YQMAAAAVPE5hbWU+a19fQmFja2luZ0ZpZWxkHTxBc3NlbWJseU5hbWU+a19fQmFja2luZ0ZpZWxkITxBc3NlbWJseUZpbGVOYW1lPmtfX0JhY2tpbmdGaWVsZAEBAQIAAAAGCwAAABlVZGZTZXJpYWxpemVyLlByb2dyYW0rPD5jBgwAAABEVWRmU2VyaWFsaXplciwgVmVyc2lvbj0xLjAuMC4wLCBDdWx0dXJlPW5ldXRyYWwsIFB1YmxpY0tleVRva2VuPW51bGwGDQAAABFVZGZTZXJpYWxpemVyLmRsbAUKAAAAKU1pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStUYXJnZXREYXRhAgAAABk8VHlwZURhdGE+a19fQmFja2luZ0ZpZWxkFzxGaWVsZHM+a19fQmFja2luZ0ZpZWxkBAQnTWljcm9zb2Z0LlNwYXJrLlV0aWxzLlVkZlNlckRlK1R5cGVEYXRhAgAAACpNaWNyb3NvZnQuU3BhcmsuVXRpbHMuVWRmU2VyRGUrRmllbGREYXRhW10CAAAAAgAAAAkOAAAACgEOAAAACAAAAAkLAAAACQwAAAAGEQAAABFVZGZTZXJpYWxpemVyLmRsbAs=
udfReturnType="string"
What we need is serializedUdf
and udfReturneType
.
- Check
/src/scala/src/main/scala/App.scala
to see how the C# UDF is injected. - Run
mvn package
from/src/scala
and it should produce/src/scala/target/sparkdotnetudf-1.0-SNAPSHOT.jar
. - Copy the
UdfSerializer.dll
built in 3) to the current working directory (run from the root of the repo). This step is important since this dll will contatin the IL bytes for the UDF defined.
cp src\csharp\UdfSerializer\bin\Debug\netcoreapp3.1\UdfSerializer.dll .
- Download and unzip the worker binaries from https://github.com/dotnet/spark/releases/tag/v0.12.1. I used .NET Core for Windows for this example and unzipped to c:\dotnet-spark-worker.
- Run the following
spark-submit
(run from the root of the repo):
%SPARK_HOME%\bin\spark-submit.cmd \
--class org.apache.spark.sql.App src\scala\target\sparkdotnetudf-1.0-SNAPSHOT.jar \
AAAAA1JvdwAAAANSb3cAAAABTgAABnsAAQAAAP////8BAAAAAAAAAAwCAAAAUk1pY3Jvc29mdC5TcGFyaywgVmVyc2lvbj0wLjQuMC4wLCBDdWx0dXJlPW5ldXRyYWwsIFB1YmxpY0tleVRva2VuPWNjN2IxM2ZmY2QyZGRkNTEFAQAAADFNaWNyb3NvZnQuU3BhcmsuVXRpbHMuQ29tbWFuZFNlckRlK1VkZldyYXBwZXJEYXRhAgAAACA8VWRmV3JhcHBlck5vZGVzPmtfX0JhY2tpbmdGaWVsZBU8VWRmcz5rX19CYWNraW5nRmllbGQEBDNNaWNyb3NvZnQuU3BhcmsuVXRpbHMuQ29tbWFuZFNlckRlK1VkZldyYXBwZXJOb2RlW10CAAAAKE1pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStVZGZEYXRhW10CAAAAAgAAAAkDAAAACQQAAAAHAwAAAAABAAAAAQAAAAQxTWljcm9zb2Z0LlNwYXJrLlV0aWxzLkNvbW1hbmRTZXJEZStVZGZXcmFwcGVyTm9kZQIAAAAJBQAAAAcEAAAAAAEAAAABAAAABCZNaWNyb3NvZnQuU3BhcmsuVXRpbHMuVWRmU2VyRGUrVWRmRGF0YQIAAAAJBgAAAAUFAAAAMU1pY3Jvc29mdC5TcGFyay5VdGlscy5Db21tYW5kU2VyRGUrVWRmV3JhcHBlck5vZGUDAAAAGTxUeXBlTmFtZT5rX19CYWNraW5nRmllbGQcPE51bUNoaWxkcmVuPmtfX0JhY2tpbmdGaWVsZBc8SGFzVWRmPmtfX0JhY2tpbmdGaWVsZAEAAAgBAgAAAAYHAAAA/wFNaWNyb3NvZnQuU3BhcmsuU3FsLlBpY2tsaW5nVWRmV3JhcHBlcmAyW1tTeXN0ZW0uU3RyaW5nLCBTeXN0ZW0uUHJpdmF0ZS5Db3JlTGliLCBWZXJzaW9uPTQuMC4wLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49N2NlYzg1ZDdiZWE3Nzk4ZV0sW1N5c3RlbS5TdHJpbmcsIFN5c3RlbS5Qcml2YXRlLkNvcmVMaWIsIFZlcnNpb249NC4wLjAuMCwgQ3VsdHVyZT1uZXV0cmFsLCBQdWJsaWNLZXlUb2tlbj03Y2VjODVkN2JlYTc3OThlXV0BAAAAAQUGAAAAJk1pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStVZGZEYXRhAwAAABk8VHlwZURhdGE+a19fQmFja2luZ0ZpZWxkGzxNZXRob2ROYW1lPmtfX0JhY2tpbmdGaWVsZBs8VGFyZ2V0RGF0YT5rX19CYWNraW5nRmllbGQEAQQnTWljcm9zb2Z0LlNwYXJrLlV0aWxzLlVkZlNlckRlK1R5cGVEYXRhAgAAAClNaWNyb3NvZnQuU3BhcmsuVXRpbHMuVWRmU2VyRGUrVGFyZ2V0RGF0YQIAAAACAAAACQgAAAAGCQAAAAw8TWFpbj5iX18wXzAJCgAAAAUIAAAAJ01pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStUeXBlRGF0YQMAAAAVPE5hbWU+a19fQmFja2luZ0ZpZWxkHTxBc3NlbWJseU5hbWU+a19fQmFja2luZ0ZpZWxkITxBc3NlbWJseUZpbGVOYW1lPmtfX0JhY2tpbmdGaWVsZAEBAQIAAAAGCwAAABlVZGZTZXJpYWxpemVyLlByb2dyYW0rPD5jBgwAAABEVWRmU2VyaWFsaXplciwgVmVyc2lvbj0xLjAuMC4wLCBDdWx0dXJlPW5ldXRyYWwsIFB1YmxpY0tleVRva2VuPW51bGwGDQAAABFVZGZTZXJpYWxpemVyLmRsbAUKAAAAKU1pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStUYXJnZXREYXRhAgAAABk8VHlwZURhdGE+a19fQmFja2luZ0ZpZWxkFzxGaWVsZHM+a19fQmFja2luZ0ZpZWxkBAQnTWljcm9zb2Z0LlNwYXJrLlV0aWxzLlVkZlNlckRlK1R5cGVEYXRhAgAAACpNaWNyb3NvZnQuU3BhcmsuVXRpbHMuVWRmU2VyRGUrRmllbGREYXRhW10CAAAAAgAAAAkOAAAACgEOAAAACAAAAAkLAAAACQwAAAAGEQAAABFVZGZTZXJpYWxpemVyLmRsbAs= \
"\"string\"" \
c:\dotnet-spark-worker\Microsoft.Spark.Worker-0.12.1\Microsoft.Spark.Worker.exe \
%SPARK_HOME%\examples\src\main\resources\people.json
- This should output the following:
+---------------+
| udf_name(name)|
+---------------+
|my udf: Michael|
| my udf: Andy|
| my udf: Justin|
+---------------+
- Check
/src/python/main.py
to see how the C# UDF is injected. - Copy the
UdfSerializer.dll
built in 3) to the current working directory (run from the root of the repo). This step is important since this dll will contatin the IL bytes for the UDF defined.
cp src\csharp\UdfSerializer\bin\Debug\netcoreapp3.1\UdfSerializer.dll .
- Download and unzip the worker binaries from https://github.com/dotnet/spark/releases/tag/v0.12.1. I used .NET Core for Windows for this example and unzipped to c:\dotnet-spark-worker.
- Run the following
spark-submit
(run from the root of the repo):
%SPARK_HOME%\bin\spark-submit.cmd \
main.py \
AAAAA1JvdwAAAANSb3cAAAABTgAABnsAAQAAAP////8BAAAAAAAAAAwCAAAAUk1pY3Jvc29mdC5TcGFyaywgVmVyc2lvbj0wLjQuMC4wLCBDdWx0dXJlPW5ldXRyYWwsIFB1YmxpY0tleVRva2VuPWNjN2IxM2ZmY2QyZGRkNTEFAQAAADFNaWNyb3NvZnQuU3BhcmsuVXRpbHMuQ29tbWFuZFNlckRlK1VkZldyYXBwZXJEYXRhAgAAACA8VWRmV3JhcHBlck5vZGVzPmtfX0JhY2tpbmdGaWVsZBU8VWRmcz5rX19CYWNraW5nRmllbGQEBDNNaWNyb3NvZnQuU3BhcmsuVXRpbHMuQ29tbWFuZFNlckRlK1VkZldyYXBwZXJOb2RlW10CAAAAKE1pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStVZGZEYXRhW10CAAAAAgAAAAkDAAAACQQAAAAHAwAAAAABAAAAAQAAAAQxTWljcm9zb2Z0LlNwYXJrLlV0aWxzLkNvbW1hbmRTZXJEZStVZGZXcmFwcGVyTm9kZQIAAAAJBQAAAAcEAAAAAAEAAAABAAAABCZNaWNyb3NvZnQuU3BhcmsuVXRpbHMuVWRmU2VyRGUrVWRmRGF0YQIAAAAJBgAAAAUFAAAAMU1pY3Jvc29mdC5TcGFyay5VdGlscy5Db21tYW5kU2VyRGUrVWRmV3JhcHBlck5vZGUDAAAAGTxUeXBlTmFtZT5rX19CYWNraW5nRmllbGQcPE51bUNoaWxkcmVuPmtfX0JhY2tpbmdGaWVsZBc8SGFzVWRmPmtfX0JhY2tpbmdGaWVsZAEAAAgBAgAAAAYHAAAA/wFNaWNyb3NvZnQuU3BhcmsuU3FsLlBpY2tsaW5nVWRmV3JhcHBlcmAyW1tTeXN0ZW0uU3RyaW5nLCBTeXN0ZW0uUHJpdmF0ZS5Db3JlTGliLCBWZXJzaW9uPTQuMC4wLjAsIEN1bHR1cmU9bmV1dHJhbCwgUHVibGljS2V5VG9rZW49N2NlYzg1ZDdiZWE3Nzk4ZV0sW1N5c3RlbS5TdHJpbmcsIFN5c3RlbS5Qcml2YXRlLkNvcmVMaWIsIFZlcnNpb249NC4wLjAuMCwgQ3VsdHVyZT1uZXV0cmFsLCBQdWJsaWNLZXlUb2tlbj03Y2VjODVkN2JlYTc3OThlXV0BAAAAAQUGAAAAJk1pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStVZGZEYXRhAwAAABk8VHlwZURhdGE+a19fQmFja2luZ0ZpZWxkGzxNZXRob2ROYW1lPmtfX0JhY2tpbmdGaWVsZBs8VGFyZ2V0RGF0YT5rX19CYWNraW5nRmllbGQEAQQnTWljcm9zb2Z0LlNwYXJrLlV0aWxzLlVkZlNlckRlK1R5cGVEYXRhAgAAAClNaWNyb3NvZnQuU3BhcmsuVXRpbHMuVWRmU2VyRGUrVGFyZ2V0RGF0YQIAAAACAAAACQgAAAAGCQAAAAw8TWFpbj5iX18wXzAJCgAAAAUIAAAAJ01pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStUeXBlRGF0YQMAAAAVPE5hbWU+a19fQmFja2luZ0ZpZWxkHTxBc3NlbWJseU5hbWU+a19fQmFja2luZ0ZpZWxkITxBc3NlbWJseUZpbGVOYW1lPmtfX0JhY2tpbmdGaWVsZAEBAQIAAAAGCwAAABlVZGZTZXJpYWxpemVyLlByb2dyYW0rPD5jBgwAAABEVWRmU2VyaWFsaXplciwgVmVyc2lvbj0xLjAuMC4wLCBDdWx0dXJlPW5ldXRyYWwsIFB1YmxpY0tleVRva2VuPW51bGwGDQAAABFVZGZTZXJpYWxpemVyLmRsbAUKAAAAKU1pY3Jvc29mdC5TcGFyay5VdGlscy5VZGZTZXJEZStUYXJnZXREYXRhAgAAABk8VHlwZURhdGE+a19fQmFja2luZ0ZpZWxkFzxGaWVsZHM+a19fQmFja2luZ0ZpZWxkBAQnTWljcm9zb2Z0LlNwYXJrLlV0aWxzLlVkZlNlckRlK1R5cGVEYXRhAgAAACpNaWNyb3NvZnQuU3BhcmsuVXRpbHMuVWRmU2VyRGUrRmllbGREYXRhW10CAAAAAgAAAAkOAAAACgEOAAAACAAAAAkLAAAACQwAAAAGEQAAABFVZGZTZXJpYWxpemVyLmRsbAs= \
"\"string\"" \
c:\dotnet-spark-worker\Microsoft.Spark.Worker-0.12.1\Microsoft.Spark.Worker.exe \
%SPARK_HOME%\examples\src\main\resources\people.json
- This should output the following:
+---------------+
| udf_name(name)|
+---------------+
|my udf: Michael|
| my udf: Andy|
| my udf: Justin|
+---------------+