ML-compression of HEP data events using deep autoencoders with the PyTorch and fastai python libraries. The scripts in the repository were used to perform compression and evaluate the performance of autoencoders on three different datasets:
- Internal data generated from ATLAS event generator
- PhenoML dataset
- Dataset from a hackathon related with the darkmachines unsupervised challenge project
The repository is developed by Honey Gupta, as a part of the Google Summer of Code project, before which, it was built by Eric Wallin and Eric Wulff as a part of their bachelors' and master's project at Lund University. Technical explanations can be found in Eric Wulff's thesis.
A summry of the experiments performed and the results obtained as a part of Google Summer of Code 2020 Project can be found in this report.
Using virtual environment (recommended for server machines, such as LXPLUS )
-
Fetch the latest version of the project:
git clone https://github.com/Autoencoders-compression-anomaly/AE-Compression-pytorch.git cd AE-Compression-pytorch
-
chmod +x install_libs.sh ./install_libs.sh
Note that all the required packages/libraries for running the scripts in this repo have been added to the bash file. You can add others if needed.
OR
-
Create directory for your virtualenv
mkdir venv cd venv python -m virtualenv -p python3 venv source bin/activate cd ..
Now to install dependencies:
pip -r requirements.txt
-
Pull the docker container containing useful libraries:
docker pull atlasml/ml-base
Run an interactive bash shell in the container, allowing the hostmachine to open jupyter-notebooks running in the container. The port 8899 can be changed if it is already taken.
docker run -it -p 8899:8888 atlasml/ml-base
Check the container's name and attach it:
docker ps docker attach <name>
-
Install the package
Lastly the AE-Compression-pytorch package can be installed (run from the directory that holds setup.py):
pip install .
Alternatively, if you want to easily edit and run the contents of the package without manually re-installing it, instead run:
pip install -e .
With a jupyter-notebook running inside the container, one can access it on the hostmachine from the URL localhost:8899
-
This folder contains utility python scripts needed by the main python scripts.
-
pre-processing.py
: extracts data from the ATLAS (D)xAOD file-format (ROOT files) using the functions namedprep_processing.process_*()
The experiments for this dataset was done with two types of data: 4-dim data and the 27-dim data. (Although the original events holds 29 values, only 27 of them are of constant size.)
These dataframe are converted into pandas Dataframes, which in turned may be pickled for further use.
-
nn_utils.py
: holds various helpful methods for building the networks. It also contains some methods for training. -
utils.py
: holds functions for normalization and event filtering, amongst others. -
postprocessing.py
: holds various functions for saving data back into the ROOT fileformat.
-
-
This repository contains python scripts that can be used to create the training and testing datasets from ATLAS, PhenoML and DarkMachines datasets.
All the python script have very similar codes and functions, except for some small variations depending on the experiments for which they were used. The names of the python files should be self explanatory to define the task and the kind of dataset they create.
This repository also contains two Jupyter notebooks:
plot_particle_distribution.ipynb
: contains the functions to plot the particle distribution for a particular process (from the PhenoML dataset). It also contains the scripts to create data distribution plots for different process (.csv) files.process_data_as_4D.ipynb
: gives a visual intuition about different parts of the processing scripts and their functions.
-
This folder contains a script that scales (or normalizes) the data generated by the
processing
scripts. The script usesFunctionScaler
scaler to normalize the data. This was used in the experiments mentioned in Eric Wulff's thesis.However, during our experiments, we shifted to standard normalization and mainly to custom normalization, which is implemented as a part of the training and testing scripts.
Keeping this script for the sake of completeness.
-
Throughout the project, the experiments were run using a batch service at CERN called HTCondor. This folder contains the scripts that were used for submitting different training jobs during the experiments.
Have included these to ensure reproducability and for making Knowledge Transfer easier.
-
This is the folder that contains the training, testing and analysis scripts for all the experiments for the abovementioned three datasets, with standard and custom normalization.
-
phenoML
This folder contains backbone training scripts and testing notebooks.
a.
train_eventsAs4D*
can be used to train an autoencoder model on 4D data extracted from the event-level data present in the PhenoML dataset.b.
test_4D(*)
can be used to test the trained models and create residual anc correlation plots.c.
test_4D_customNorm_differentParticles_stackedPlots.ipynb
contains the script to test a model trained with 4D custom normalized data and contains the methods to create stacked or overlapped residual or error plots for performing analysis.The model for this experiment was trained on jet (j and b) particles and tested on different particles such as electrons (e-), positrons (e+), muons(μ-), antimuons (μ+) and photons (γ)
d.
4D_customNorm
contains the scripts and the models used for training and analysing the 4D data from PhenoML dataset. The model in this is the one used to create the analysis plot for the related experimente.
4D_stdNorm
: similar to custom norm folder, this contains the training script, trained model and the testing jupyter notebook for experiments performed with standard normalizationf.
half_data
: this folder has the same structure as above, just that the models in this folder were trained with half the training data used in the above experiments. -
darkmachines
4D_customNorm/
contains the training script to train an autoencoder model on data belonging tochan2a
type of the DarkMachine challenge dataset, which mostly contains other particles (e-, e+, μ-, μ+, γ) mixed with jets but with a lesser percentage of jets.The analysis was done on
chan3
data, that mostly contains jets (j and b). -
ATLAS data
a.
4D
: An example of training a 4D-network can be found inexamples/4D/4D_training.ipynb
.Fastai saves trained models in the folder
models/
relative to the training script, with the .pth file extension. An example of running or testing a 4-dimensional already trained network can be found inTLA_analysis.ipynb
b.
27D
: Most of the code here was take from the repository by Eric Wulff.27D_train.py
contains the script to train the network on a 27D data. For an example of analysing a 27-D network, you can refer to27D_analysis.py
.
-
The scripts to process data can be found in the process_data
folder. The process.sub
and process.sh
files can be used to submit a batch job on HTCondor
.
A description of the scripts can be found in the previous section.
The training for different experiments can be found in the examples
folder according to their dataset and nomalization type.
Again, a description of the folder and scripts can be found in the previous section.
Each experiment's folder in the examples
folder contains a Jupyter notebook to load the testing dataset, load the pre-trained model, run the models on the testing datasets and create residual and correlation plots.
Instructions to understand the methods in the Notebook can be found inside the notebooks.