Notebooks to reproduce the Higgs discovery plots from ATLAS and CMS from public data.
You can find the notebooks to reproduce the Higgs discovery plots in the talks directory. The notebooks directory contains practice notebooks used to develop concepts for the analysis. They are not necessarily well documented. The talk directory contains information on what notebooks are availible.
If you were to ask - what big thing is missing? The answer would be the determination of systematic errors. The collaborations, of course, paid extensive attention to this. However, it requires a lot more data, studies, and tests, and so does note appear in these Open Data demos.
This repository was originally used for a talk at PyHEP 2021.
Anyone should reproduce the ATLAS higgs plot without hesitation. Reproducing the CMS plot, however, requires real reasources: it accesses over 70 TB of data, and definately puts stresses on international infrastructure!
You'll need a servicex.yaml
file in your home directory that contains something like the following:
api_endpoints:
- endpoint: http://xxx.org
type: open_uproot
- endpoint: http://yyy.org
type: cms_run1_aod
backend_types:
- type: open_uproot
return_data: parquet
- type: cms_run1_aod
return_data: root
Please get in touch with us to get the address of the open instances running ServiceX
.
Setup your environment:
-
This has been run under python 3.9.6. It should work with anything that is 3.7 or greater.
-
Check out this repository locally, and check out the coffea patched repository locally.
-
For the
coffea
repository, check out the branch pr_servicex_flat_root_files. For this package use the head. -
python -m venv .venv
, and activate the new environment. -
pip install -r requirements.txt
-
In the root directory of the checked out
coffea
package, runpip install -e[servicex]
. -
Repeat the
pip instlal -r requirements.txt
command ascoffea
will over-write one of the packages needed.
From there you can start jupyter-lab
.
If you are on windows, you'll need to make sure LongPathNames are turned on - as some of the CMS pathnames are longer than... well... heck.
It is not currently possible to run on binder
as ServiceX
uses a non-standard port to download data.
At the talk about 45 TB of data was used for the CMS plot out of the full 70TB. Along the way, there were a number of issues discovered with running with datasets that large. The issues describes a list of issues that were encountered. As they are worked on, this repository will be updated to indicate the improvements in running on the full 70 TB dataset.
Contributions are welcome!! Please submit as pull requests!