beam-tpcds-benchmarks

This project provides some tooling to easily benchmark Beam runners using the TPCDS benchmark suite of Beam using different configurations with the goal to easily gather metrics beyond the published ones at http://metrics.beam.apache.org.

Metrics on particular queries are parsed from the benchmark output. Overall metrics for an entire benchmark run (involving multiple queries) are gathered using the time command:

Instructions

Configure a Python environment to run Jupyter notebooks, e.g. using conda:

conda config --add channels conda-forge
conda create python=3.9 -n jupyter-py39
conda activate jupyter-py39
conda install jupyter pyspark ipykernel nb_conda seaborn vega pandas

Pull the code of Beam and this repo. Note: You will also need a copy of Beam's tpcds input data.
Configure config.ipynb accordingly.

Additionally you have to add the following task to build.gradle of Beam's tpcds module to export the classpath.

if(project.hasProperty("target")) {
    task cpcopy(type: Copy) {
        dependsOn classes
        from configurations.gradleRun.asFileTree.files
        into file(project.findProperty("target")) // replace with the path to your desired directory
        include '**/*.jar'
    }
}

Run your benchmarks using the provided notebook.

Troubleshooting

Exporting the classpath using the task above doesn't always work reliably. If you see a ClassNotFoundException, try running tpcds from gradle first.

gradle :sdks:java:testing:tpcds:run -Ptpcds.runner=":runners:flink:1.16" -Ptpcds.args=" \
  --runner=FlinkRunner \
  --queries=3,7,10 \
  --tpcParallel=1 \
  --sourceType=PARQUET \
  --dataDirectory=/tmp/tpcds/nonpartitioned \
  --dataSize=1GB  \
  --resultsDirectory=/tmp/tpcds_results/"

mosche/beam-tpcds-benchmarks

beam-tpcds-benchmarks

Instructions

Troubleshooting