G-Research/spark-dgraph-connector

Add performance data source

Opened this issue · 0 comments

Add a data source that does not read the actual data but provides performance metrics. Each partition sends a query to the Dgraph cluster and retrieved besides the data also these metrics:

  "extensions": {
    "server_latency": {
      "parsing_ns": 78501,
      "processing_ns": 881611,
      "encoding_ns": 110785,
      "total_ns": 1145597
    },
    "txn": {
      "start_ts": 10007
    },
    "metrics": {
      "num_uids": {
        "dgraph.graphql.schema": 10,
        "dgraph.type": 10,
        "director": 10,
        "name": 10,
        "release_date": 10,
        "revenue": 10,
        "running_time": 10,
        "starring": 10,
        "uid": 16
      }
    }
  }

The performance data source can encode these information (together with information from TaskContext and the individual partitions) rather than the actual data result into the DataFrame. This provides benchmarking tools to measure per-partition timings and cardinality information and write them via Spark to disk.