Coffea is a columnar analysis framework, has a similar interface to common python packages like numpy or pandas. This repository is an implementation of the single stop analysis written in Coffea.
Begin by cloning and entering the repository.
git clone git@github.com:UMN-CMS/SingleStopCoffea.git
cd SingleStopCoffea
We are using a container based system, using different containers for different analysis states (in the future we may build a custom container with everything we need).
For basic analysis, use the coffea
mode, which includes dask
and coffea
to produce histograms.
source setup.sh coffea
If performing post processing, use the torch
mode, which includes tools for performing the gaussian process regression.
source setup.sh torch
If you are not using a CMS machine, you can still try to install the analyzer locally by creating a virtual environment, then running
python3 -m pip install .
The entrypoint for the program is the command
python3 -m analyzer
For more information on the use of the program, call
python3 -m analyzer --help
The below command will run a basic analysis on the signal_312_2000_1900
mass point. This analysis will run locally.
python3 -m analyzer run -o "myoutput.pkl" -m objects baseline_selection dataset_category event_level jets -s signal_312_2000_1900
python3 -m analyzer samples
python3 -m analyzer modules
To analyze more substantial datasets, we need to be able to take advantage of distributed computing resources.
The cluster-start
subcommand provides the ability to start a cluster. The below command shows how to start a cluster consisting of 20 condor worker nodes on the LPC.
python3 -m analyzer cluster-start -t lpcondor -s :10005 -d localhost:8787 -m "2.0GB" -w 20
The analyzer is composed of modules, which may be chained together to do an analysis. In general an analyzer module does one of three things: filtering events, generating new columns, or producing histograms.
All modules are located in analyzer/modules/. A module is a function with the following signature
def myAnalysisModule(events, analyzer):
...
return events,analyzer
In this function events
is the current set of events being analyzed, possibly after being filtered and manipualted by previous modules.
The analyzer
object keeps track of the current state of the analysis between modules, and provides functions for making histograms, filtering, etc.
To make a module known to the framework, simply decorate it with the @analyzerModule
decorator.
For example
@analyzerModule("mymodulename", categories="main", depends_on=["objects"])
def myAnalysisModule(events, analyzer):
...
return events,analyzer
The first argument to the decorator is the identifier of the module, and is the name used to select it from the command line. The categories
and depends_on
arguments enforce an ordering between modules.
The depends_on
keyword specifies that this module must run after the modules with the names given as the argument.
Many modules depend on the objects
module to provide them, with higher level object, like events.good_jets
, events.loose_b
, etc.
The category is a higher level ordering between modules, rather than a dependency.
In general, all new modules should be of the main
category, which ensures correct ordering for categorization, etc.
You can see a complete list of the available modules by calling
python3 -m analyzer modules
The second argument to each analyzer module is a special object of type DatasetProcessor
used to manage the state of the analysis beween modules.
For most users, the only use of this object will be through the DatasetProcessor.H(histogram_name, axis_names, data)
function, which is used to create histograms.
For example, the below module creates a histogram of the pt of the second jet in each event:
@analyzerModule("simplejet", categories="main", depends_on=["objects", "event_level"])
def makeSimpleJetHistogram(events, analyzer):
gj = events.good_jets
analyzer.H("pt_2",
makeAxis(60, 0, 3000, f"$p_{{T, 2}}$", unit="GeV"),
events.good_jets[:, 1].pt)
return events,analyzer
This also showcases the makeAxis
function, which can be used to quickly create a labeled axis. The unit argument can be used by plotting scripts to ensure correct labeling of the count axis.
You can also pass in a list of axes and data to create a multidimensional histogram.
Note: if your data object is masked, then you must pass in this mask using the mask
keyword argument to ensure that the weights are properly adjusted.
Coffea works very nicely when used with jupyter notebooks. Chances are you want to run the notebook on a remote node, but view it in your local browser. This is accomplished by forwarding a port from your local machine to the remote machine using the below ssh command
ssh -N -L 5000:localhost:8999 username@remotehost
On the remote host, start the notebook by running from within the python environment,
start_jupyter
You can pick any port, for the remote machine, and you may need to if someone else is using that port.
The notebook notebooks/example.ipynb shows an example of doing basic analysis within a notebook.