PyROOT Parallelization with Spark

Project that explores the Spark parallelization of ROOT analysis, in particular using the ROOT Python interface (PyROOT).

The parallelisation strategy applies the map-reduce pattern to the processing of a ROOT TTree. In the map phase, each mapper reads and processes a sub-range of TTree entries and produces a partial result, while the reduce phase combines all the partial outputs into a final result (i.e. a set of filled histograms).

In the programming model, based on the Python language, the user creates a DistTree object from a list of files containing a TTree and the TTree name. Moreover, the number of partitions (sub-ranges) of the TTree can also be specified. In order to start the parallel processing, the user invokes the ProcessAndMerge function on the DistTree. The parameters of this function are the mapper and reducer functions. The mapper receives a TTreeReader, a ROOT object that represents a sub-range of entries and that can be iterated on.

This code snippet gives an example of how the DistTree class can be used:

# ROOT imports
import ROOT
from DistROOT import DistTree

# Build the DistTree
dTree = DistTree(filelist = ["myFile1", "myFile2"],
                 treename = "myTree",
                 npartitions = 8)

# Trigger the parallel processing
myHistos = dTree.ProcessAndMerge(fillHistos, mergeHistos)

Authors:

etejedor/root-spark

PyROOT Parallelization with Spark