/reana-demo-alice-lego-train-test-run

REANA example - ALICE LEGO train test run

Primary LanguageCGNU General Public License v2.0GPL-2.0

REANA example - ALICE LEGO train test run

About

This REANA reproducible analysis example performs ALICE LEGO train test run and validation. The procedure is used in ALICE collaboration particle physics analyses. Please see arXiv:1502.06381 for more detailed description of the ALICE analysis train system.

Analysis structure

Making a research data analysis reproducible means to provide "runnable recipes" addressing (1) where the input datasets are, (2) what software was used to analyse the data, (3) which computing environment was used to run the software, and (4) which workflow steps were taken to run the analysis.

1. Input data

This example uses ALICE Pb-Pb collision data input files. We can take a sample Pb-Pb ESD open data file that the ALICE collaboration released on the CERN Open Data portal, for example the following sample taken at 3.5 TeV from run number 139038 in RunH 2010: (beware, the file is 360MB large)

$ mkdir -p __alice__data__2010__LHC10h_2__000139038
$ cd __alice__data__2010__LHC10h_2__000139038
$ wget http://opendata.cern.ch/record/1102/files/assets/alice/2010/LHC10h/000139038/ESD/0003/AliESDs.root
$ cd ..

Note that data.txt file should contain the path to the downloaded sample data file.

2. Analysis code

This example uses the AliPhysics analysis framework with the following source code files:

The user provides notably the MLTrainDefinition.cfg file which defines a set of train wagons that compose the analysis train run. In this example, the following wagons are defined:

$ grep Begin MLTrainDefinition.cfg
#Module.Begin        Centrality_CF_AP
#Module.Begin        PIDresponse_CF_AP
#Module.Begin        Run1NetPiBASE_CF_AP

The first wagons are usually related to centralised data selection tasks, while the main user analysis is executed in the last Run1NetPiBASE_CF_AP wagon.

The runTest.sh script will take care of creating the train test run, running it, and validating its outputs.

3. Compute environment

This example uses AliPhysics analysis framework. It has been containerised as reana-env-aliphysics environment. You can fetch some wanted AliPhysics version from Docker Hub:

$ docker pull reanahub/reana-env-aliphysics:vAN-20180614-1

We shall use the vAN-20180614-1 version for the present example.

Note that if you would like to build a different AliPhysics version on your own, you can follow reana-env-aliphysics procedures and set ALIPHYSICS_VERSION environment variable appropriately:

$ cd src/reana-env-aliphysics
$ export ALIPHYSICS_VERSION=vAN-20180521-1
$ make build

4. Analysis workflow

The researcher typically uses a single test run command:

$ ./runTest.sh

which performs all the tasks related to the analysis train generation, running and validation. Underneath, the following sequence of commands is called:

# generate the LEGO train run and validation files:
aliroot -b -q generate.C > generation.log

# perform the LEGO train test run:
source ./lego_train.sh > stdout 2> stderr

# verify that the expected result files are well present:
source ./lego_train_validation.sh > validation.log

The produced log files indicate whether the train test run was successful and whether the output is validated.

The computational workflow is therefore essentialy sequential in nature. We can use the REANA serial workflow engine and represent the analysis workflow as follows:

           START
            |
            |
            V
+----------------------------------------+
| (1) download ESD input data file       |
|                                        |
|    $ wget http://opendata.cern.ch/...  |
+----------------------------------------+
            |
            | ALIESD.root
            V
+----------------------------------------+       +-------------------------+
| (2) generate LEGO train files          |       |       input code        |
|                                        |  <----|   MLTrainDefinition.cfg |
|    $ aliroot -b -q generate.C          |       |   env.sh handlers.C ... |
+----------------------------------------+       +-------------------------+
            |
            | lego_train.sh
            | lego_train_validation.sh
            | ...
            V
+----------------------------------------+
| (3) perform LEGO train test run        |
|                                        |
|    $ source ./lego_train.sh            |
+----------------------------------------+
            |
            | stdout
            | AnalysisResults.root
            | ...
            V
+----------------------------------------+
| (4) validate test run outputs          |
|                                        |
|    $ source ./lego_train_validation.sh |
+----------------------------------------+
            |
            | validation.log
            | AnalysisResults.root
            V
+----------------------------------------+
| (5) plot sample histogram              |
|                                        |
|    $ root -b -q plot.C                 |
+----------------------------------------+
            |
            | plot.pdf
            V
           STOP

We shall see below how this sequence of commands is represented for the REANA serial workflow engine.

5. Output results

The output of the ALICE LEGO analysis train test run and validation is available in the stdout file. The success or failure is reported at the end:

$ tail -4 stdout
* ----------------------------------------------------*
* ----------------   Job Validated  ------------------*
* ----------------------------------------------------*
*******************************************************

The test run will also create ROOT output files that usually contain histograms.

$ ls -l AnalysisResults.root EventStat_temp.root
-rw-r--r-- 1 root root 393111 May 30 17:35 EventStat_temp.root
-rw-r--r-- 1 root root  31187 May 30 17:35 AnalysisResults.root

The user typically uses the output files to produce final plots. For example, running plot.C output macro on the AnalysisResults.root output file will permit to visualise the centrality of accepted events:

plot.png

Low centralities mean that the the Pb particles hit each other a lot and many nucleons collide. High centralities mean that the Pb particles barely interacted and only very few nucelons did collide.

Running the example on REANA cloud

We start by creating a reana.yaml file describing the above analysis structure with its inputs, code, runtime environment, computational workflow steps and expected outputs:

version: 0.3.0
inputs:
  files:
  - MLTrainDefinition.cfg
  - data.txt
  - env.sh
  - generate.C
  - generator_customization.C
  - globalvariables.C
  - handlers.C
  - plot.C
  - runTest.sh
  - fix-env.sh
workflow:
  type: serial
  specification:
    steps:
      - environment: 'reanahub/reana-env-aliphysics:vAN-20180614-1'
        commands:
        - 'mkdir __alice__data__2010__LHC10h_2__000139038/'
        - 'wget http://opendata.cern.ch/record/1102/files/assets/alice/2010/LHC10h/000139038/ESD/0003/AliESDs.root'
        - 'mv AliESDs.root __alice__data__2010__LHC10h_2__000139038/'
        - 'source fix-env.sh && source env.sh && aliroot -b -q generate.C | tee generation.log 2> generation.err'
        - 'source fix-env.sh && source env.sh && export ALIEN_PROC_ID=12345678 && source ./lego_train.sh | tee stdout 2> stderr'
        - 'source fix-env.sh && source env.sh && source ./lego_train_validation.sh | tee validation.log 2> validation.err'
        - 'source fix-env.sh && source env.sh && root -b -q ./plot.C'
outputs:
  files:
  - plot.pdf

We can now install the REANA command-line client, run the analysis and download the resulting plots:

$ # create new virtual environment
$ virtualenv ~/.virtualenvs/myreana
$ source ~/.virtualenvs/myreana/bin/activate
$ # install reana-client utility
$ pip install reana-client
$ # connect to some REANA cloud instance
$ export REANA_SERVER_URL=https://reana.cern.ch/
$ export REANA_ACCESS_TOKEN=XXXXXXX
$ # create new workflow
$ reana-client create -n my-analysis
$ export REANA_WORKON=my-analysis
$ # upload input code and data to the workspace
$ reana-client upload MLTrainDefinition.cfg data.txt \
  env.sh generate.C generator_customization.C globalvariables.C \
  handlers.C plot.C runTest.sh fix-env.sh
$ # start computational workflow
$ reana-client start
$ # ... should be finished in about a minute
$ reana-client status
$ # list workspace files
$ reana-client list
$ # download output results
$ reana-client download stdout
$ reana-client download plot.pdf

Please see the REANA-Client documentation for more detailed explanation of typical reana-client usage scenarios.

Contributors

The list of contributors in alphabetical order: