High-energy Physics Analysis Queries in Google BigQuery

This repository contains implementations of High-energy Physics (HEP) analysis queries from the IRIS HEP benchmark written in SQL to be run on Google BigQuery.

Motivation

The purpose of this repository is to study the suitability of SQL for HEP analyses and to serve as a use case for improving database technologies. While SQL has often been considered unsuited for HEP analyses, we believe that the support for arrays and structured types introduced in SQL:1999 make SQL actually a rather good fit. As a high-level declarative language, it has the potential to bring many of its benefits to HEP including transparent query optimization, transparent caching, and a separation of the logical level (in the form of data model and query) from the physical level (in the form of storage formats and performance optimizations).

Prerequisites and setup

Install Python 3 with pip.
Install the Python requirements:
```
pip3 install -r requirements.txt
```
Create an account for Google Cloud (you may need to set up payment information but the free tier should be more than enough for the queries in this repository).
Set up a project.
Install Google Cloud SDK and configure the default project (or be prepared to set up BigQuery through the web interface).

Data

The benchmark defines a data set in the ROOT format, which is not supported by BigQuery. However, the Rumble implementation of the benchmark provides scripts to convert the data to Parquet, which BigQuery can query in-place or load into its internal representation.

Two versions are supported: "shredded objects" (where lists of particles within an event are stored as several arrays of numeric types, one for each attribute of the particles) and "native objects" (where such lists are stored as a single array of structs). Note that this only affects the data model exposed to queries; the physical representation of the two versions in Parquet is (almost) identical. Also, there is only a single set of queries, which assumes the "native objects" representation; a view (described below) can expose data in the "shredded objects" form as "native objects".

For convinience, this repository contains a sample of the first 1000 events of the full data set in both representations in the data/ folder.

Native Objects

Internal Storage

This loads the data into tables stored in BigQuery's internal storage layer.

Set up a bucket and upload the Parquet file(s):

gsutil mb -l europe-west6 gs://my-iris-hep-bucket  # Adapt the bucket name
gsutil cp data/*.parquet gs://my-iris-hep-bucket/

Create a "data set" (roughly speaking a namespace for tables):

bq --location=europe-west6 mk \
    --dataset \
    --description "IRIS HEP Benchmark Data" \
    iris_hep_benchmark_data

Load a Parquet file into a new table:

bq load \
    --source_format=PARQUET \
    iris_hep_benchmark_data.Run2012B_SingleMu_1000 \
    "gs://my-iris-hep-bucket/Run2012B_SingleMu-1000-restructured.parquet"

Modify the following command to reflect your data set ID and the table you just loaded, then create a view:

bq mk \
    --use_legacy_sql=false \
    --view "$(sed "s/dataset_id.table_name/iris_hep_benchmark_data.Run2012B_SingleMu_1000/" view-native.sql)" \
    iris_hep_benchmark_data.Run2012B_SingleMu_1000_view

External Storage

This creates external tables that processes the Parquet files directly off cloud storage.

Set up the bucket, data set, and files as described above if you haven't done it yet.

Create a data definition file called external_1000.json with the following content:

{
  "sourceFormat": "PARQUET",
  "sourceUris": [
    "gs://my-iris-hep-bucket/Run2012B_SingleMu-1000-restructured.parquet"
  ]
}

Create an external table:

bq mk \
    --external_table_definition=external_1000.json \
    iris_hep_benchmark_data.Run2012B_SingleMu_restructured_external_1000

Create a corresponding view as described above.

Shredded Objects

The following creates an internal table. Creating an external table works similar as described above.

Set up the bucket and data set above if you haven't done it yet.

Load a Parquet file into a new table:

bq load \
    --source_format=PARQUET \
    iris_hep_benchmark_data.Run2012B_SingleMu_shredded_1000 \
    "gs://my-iris-hep-bucket/Run2012B_SingleMu-1000.parquet"

Modify the following command to reflect your data set ID and the table you just loaded, then create a view:

bq mk \
    --use_legacy_sql=false \
    --view "$(sed "s/dataset_id.table_name/iris_hep_benchmark_data.Run2012B_SingleMu_shredded_1000/" view-shredded.sql)" \
    iris_hep_benchmark_data.Run2012B_SingleMu_shredded_1000_view

Naming Convention for this Implementation

By default, test_queries.py looks for tables (or views) of the form Run2012B_SingleMu{suffix}.parquet, where {suffix1} is empty for the full data set and _{num_events} for a sample of {num_events}. A command line option allows to override the default name (see below). It also looks for reference results in queries/{query_name}/ref{suffix2}.csv where {suffix2} is empty for the full data set and -{num_events} for samples.

Running Queries

Queries are run through test_queries.py. Run the following command to see its options:

$ ./test_queries.py --help
usage: test_queries.py [options] [file_or_dir] [file_or_dir] [...]

...
custom options:
  --query-id=QUERY_ID   run all combinations
  -N NUM_EVENTS, --num-events=NUM_EVENTS
                        Number of events taken from the input table. This influences which reference file should be taken.
  -P BIGQUERY_DATASET, --bigquery-dataset=BIGQUERY_DATASET
                        Name of dataset in BigQuery.
  -I INPUT_TABLE, --input-table=INPUT_TABLE
                        Name of input table.
  --freeze-result       Overwrite reference result.
  --plot-histogram      Plot resulting histogram as PNG file.

For example, the following command runs queries 6-1 and 6-2 against the view for the sample of 1000 events created above:

./test_queries.py -vs --bigquery-dataset iris_hep_benchmark_data \
    --num-events 1000 --input-table Run2012B_SingleMu_1000_view \
    -k query-6

RumbleDB/iris-hep-benchmark-bigquery