/SingleStopCoffea

Primary LanguageJupyter Notebook

Introduction

Coffea is a columnar analysis framework, has a similar interface to common python packages like numpy or pandas. This repository is an implementation of the single stop analysis written in Coffea.

Getting started

Begin by cloning and entering the repository.

git clone git@github.com:UMN-CMS/SingleStopCoffea.git
cd SingleStopCoffea

Running on a CMS Machine

We are using a container based system, using different containers for different analysis states (in the future we may build a custom container with everything we need).

For basic analysis, use the coffea mode, which includes dask and coffea to produce histograms.

source setup.sh coffea

If performing post processing, use the torch mode, which includes tools for performing the gaussian process regression.

source setup.sh torch

Running on a non-CMS Machine

If you are not using a CMS machine, you can still try to install the analyzer locally by creating a virtual environment, then running

python3 -m pip install .

Running the Program

The entrypoint for the program is the command

python3 -m analyzer

For more information on the use of the program, call

python3 -m analyzer --help

Examples

Running a Basic Analysis

The below command will run a basic analysis on the signal_312_2000_1900 mass point. This analysis will run locally.

python3  -m analyzer run -o "myoutput.pkl" -m objects baseline_selection dataset_category event_level jets -s signal_312_2000_1900

Viewing Information on Samples

python3  -m analyzer samples

Viewing Information on Modules

python3  -m analyzer modules

Scaling Out

To analyze more substantial datasets, we need to be able to take advantage of distributed computing resources.

Starting a Cluster

The cluster-start subcommand provides the ability to start a cluster. The below command shows how to start a cluster consisting of 20 condor worker nodes on the LPC.

python3 -m analyzer cluster-start -t lpcondor -s :10005 -d localhost:8787 -m "2.0GB" -w 20

Extending the Analyzer

The analyzer is composed of modules, which may be chained together to do an analysis. In general an analyzer module does one of three things: filtering events, generating new columns, or producing histograms.

All modules are located in analyzer/modules/. A module is a function with the following signature

def myAnalysisModule(events, analyzer):
    ...
    return events,analyzer

In this function events is the current set of events being analyzed, possibly after being filtered and manipualted by previous modules. The analyzer object keeps track of the current state of the analysis between modules, and provides functions for making histograms, filtering, etc.

To make a module known to the framework, simply decorate it with the @analyzerModule decorator. For example

@analyzerModule("mymodulename", categories="main", depends_on=["objects"])
def myAnalysisModule(events, analyzer):
    ...
    return events,analyzer

The first argument to the decorator is the identifier of the module, and is the name used to select it from the command line. The categories and depends_on arguments enforce an ordering between modules. The depends_on keyword specifies that this module must run after the modules with the names given as the argument. Many modules depend on the objects module to provide them, with higher level object, like events.good_jets, events.loose_b, etc.

The category is a higher level ordering between modules, rather than a dependency. In general, all new modules should be of the main category, which ensures correct ordering for categorization, etc.

You can see a complete list of the available modules by calling

python3 -m analyzer modules

The Analyzer Object

The second argument to each analyzer module is a special object of type DatasetProcessor used to manage the state of the analysis beween modules.

For most users, the only use of this object will be through the DatasetProcessor.H(histogram_name, axis_names, data) function, which is used to create histograms.

For example, the below module creates a histogram of the pt of the second jet in each event:

@analyzerModule("simplejet", categories="main", depends_on=["objects", "event_level"])
def makeSimpleJetHistogram(events, analyzer):
    gj = events.good_jets
    analyzer.H("pt_2", 
            makeAxis(60, 0, 3000, f"$p_{{T, 2}}$", unit="GeV"),
            events.good_jets[:, 1].pt)
    return events,analyzer

This also showcases the makeAxis function, which can be used to quickly create a labeled axis. The unit argument can be used by plotting scripts to ensure correct labeling of the count axis.

You can also pass in a list of axes and data to create a multidimensional histogram.

Note: if your data object is masked, then you must pass in this mask using the mask keyword argument to ensure that the weights are properly adjusted.

Using Notebooks

Coffea works very nicely when used with jupyter notebooks. Chances are you want to run the notebook on a remote node, but view it in your local browser. This is accomplished by forwarding a port from your local machine to the remote machine using the below ssh command

ssh -N -L 5000:localhost:8999 username@remotehost

On the remote host, start the notebook by running from within the python environment,

start_jupyter

You can pick any port, for the remote machine, and you may need to if someone else is using that port.

The notebook notebooks/example.ipynb shows an example of doing basic analysis within a notebook.