/gruid

Druid gRPC data transfer extension

Primary LanguageJava

gRPC Druid extension PoC

For a sophisticated engine dealing with voluminous data Druid lacks means of retrieving query results efficiently. Judging from a rejected issue I thought it was a matter of policy. I wanted to see what it takes to actually implement it.

When I had troubles with transitive dependency conflicts and asked politely :) the kind folks on Druid dev mailing list shared another similar extension. I borrowed their Guava version conflict workaround based on shading Guava classes and manually setting a classloader during initialization. The rest of it is not exactly what I have in mind so this PoC still makes sense to me.

There are two libraries in this project:

  • druid-grpc-rowbatch is a library with support for efficient row encoding using protobuf
  • druid-grpc is an actual Druid extension that can be plugged into Druid to provide a gRPC network endpoint and completely bypass JSON over HTTP
Key techniques

There's more than one way to use protobuf for data row representation. I guided my first iteration with a few ideas traditional in analytics query engines.

  • columnar formats - try columnar layout for data structures first, fall back to row-oriented if unsuccessful
  • micro-batching - never send a single row over the wire to amortize serialization and latency costs
  • dictionary encoding of dimension values - the actual strings matter in the UI only; storage, data transfer, and common operations such as comparison are much more efficient with integer types
  • collections library with support for primitive numeric types - there are Java collection libraries that avoid the penalty of autoboxing
Alternatives
  • Avro over gRPC - too generic for a first iteration, might happen later
  • Arrow / Flight - very new, the RPC part is under-documented, not used by Druid core anyway (and so promises more transitive dependency whack-a-mole fun)
  • Avatica - it seems to have a binary, protobuf-based transport but there's not enough documentation. The Calcite integration is rather complicated and it will take time to grok.
Realistic usage example

Please see the druid-forecast module README file for a PoC application using this approach as a client.

Running locally

Executing DruidGrpcQueryRunnerTest is the easiest way to see this extension working with little hassle.

Otherwise, build both modules locally with mvn clean install. Then follow the instructions from the next section.

Druid extension configuration

Assuming official Druid tutorial setup in place:

  • copy the locally built extension uber JAR file
  • edit Druid configuration to enable and configure the extension
  • start up Druid with tutorial configuration
  • run DruidClientTest and enjoy output of tail -f $DRUID_HOME/var/sv/broker.log
mkdir -p $DRUID_HOME/extensions/druid-grpc/
cp druid-grpc-0.20.0-SNAPSHOT.jar $DRUID_HOME/extensions/druid-grpc/

cd $DRUID_HOME
vi $DRUID_HOME/conf/druid/single-server/micro-quickstart/broker/runtime.properties

druid.extensions.loadList=["druid-grpc"]
druid.grpc.enable=true
druid.grpc.port=20001
druid.grpc.numHandlerThreads=8
druid.grpc.numServerThreads=4

./bin/start-micro-quickstart