jetclass-qcd-top

This repo contains the code to filter the JetClass dataset for QCD/top jets and add the ParT predictions.

The code is not refined yet, but gets the job done for now.

The resulting datasets are stored on DESY Sync&Share. Each dataset contains approximately the same number of (hadronic) top-jets and QCD jets.

Filename Number of jets
filtered_jetclass_train.h5 4M (2M/class)
filtered_jetclass_val.h5 1M (500k/class)
filtered_jetclass_test.h5 2M (1M/class)

Downloading and reading the files

You can download the files via the web interface here.

After downloading the files, you can simply load the files using e.g. pandas or h5py. The code snippet below assumes you are in the directory where the files are located.

import pandas as pd

df_train = pd.read_hdf("filtered_jetclass_train.h5", key="df")
df_val = pd.read_hdf("filtered_jetclass_val.h5", key="df")
df_test = pd.read_hdf("filtered_jetclass_test.h5", key="df")

Content of the output files

The files contain the jet-level features as well as the (rescaled ParT top quark) prediction.

Variable name Description
jet_p_top_ParT_kin = p_Tbqq / (p_Tbqq + p_QCD) Rescaled top quark probability of ParT-kin
jet_p_top_ParT_full = p_Tbqq / (p_Tbqq + p_QCD) Rescaled top quark probability of ParT-full
label_top label_top=1 for top jets and label_top=0 for QCD jets
jet_pt
jet_eta
jet_phi
jet_energy
jet_nparticles
jet_sdmass
jet_tau1
jet_tau2
jet_tau3
jet_tau4
aux_genpart_eta
aux_genpart_phi
aux_genpart_pid
aux_genpart_pt
aux_truth_match

To see the full comparison of variables between the original JetClass dataset and the filtered dataset, click on the arrow below.

Overview / variable comparison to original JetClass dataset
Variable name ✅ Included / ❌ Removed / 🆕 added
label_top 🆕 Added
jet_p_top_ParT_kin 🆕 Added
jet_p_top_ParT_full 🆕 Added
part_px ❌ Removed
part_py ❌ Removed
part_pz ❌ Removed
part_energy ❌ Removed
part_deta ❌ Removed
part_dphi ❌ Removed
part_d0val ❌ Removed
part_d0err ❌ Removed
part_dzval ❌ Removed
part_dzerr ❌ Removed
part_charge ❌ Removed
part_isChargedHadron ❌ Removed
part_isNeutralHadron ❌ Removed
part_isPhoton ❌ Removed
part_isElectron ❌ Removed
part_isMuon ❌ Removed
label_QCD ❌ Removed
label_Hbb ❌ Removed
label_Hcc ❌ Removed
label_Hgg ❌ Removed
label_H4q ❌ Removed
label_Hqql ❌ Removed
label_Zqq ❌ Removed
label_Wqq ❌ Removed
label_Tbqq ❌ Removed
label_Tbl ❌ Removed
jet_pt ✅ Included
jet_eta ✅ Included
jet_phi ✅ Included
jet_energy ✅ Included
jet_nparticles ✅ Included
jet_sdmass ✅ Included
jet_tau1 ✅ Included
jet_tau2 ✅ Included
jet_tau3 ✅ Included
jet_tau4 ✅ Included
aux_genpart_eta ✅ Included
aux_genpart_phi ✅ Included
aux_genpart_pid ✅ Included
aux_genpart_pt ✅ Included
aux_truth_match ✅ Included

Run the code (on the DESY Maxwell cluster)

You'll have to make sure that you have the JetClass dataset stored on your machine and adapt the paths in prepare_dataset.py accordingly.

The code can then be executed within this repo by running the following singularity command:

singularity exec --nv -B /home -B /beegfs /beegfs/desy/user/birkjosc/singularity_images/pytorch-image-v0.0.8.img \
    bash -c "source /opt/conda/bin/activate && python prepare_dataset.py"

If you don't have access to the DESY Maxwell cluster, you can also run the code somewhere else of course, but you'll have to build the singularity image yourself. The image is located on DockerHub at jobirk/pytorch-image:v0.0.8.