ctstrkx: A Jupyter Notebook repository from AI4FAIR

1. Running the Pipeline

This code uses the Exa.TrkX-HSF pipeline as a baseline. It uses the traintrack library to run different stages (Processing, DNN/GNN, and Segmenting) of the STT pipeline. This pipeline is intended for the Straw Tube Tracker (STT) of the PANDA experiment which is part of the Central Tracking System (CTS) located in the Target Spectrometer of the PANDA experiment.

1.1 Running the Pipeline on CPU

Once a conda environment is successfully created (see envs/README.md for building a conda environment), one can run the pipeline from the root directory as follows:

# running pipeline
conda activate exatrkx-cpu
export EXATRKX_DATA=path/to/dataset
traintrack configs/pipeline_quickstart.yaml

1.2 Running Pipeline on Cluster

Follow instructions on NERSC Documentation or see the concise and essential version in NERSC.md to run pipeline on the Cori cluster at NERSC.

2. Understanding the Pipeline

The deep learning pipeline consists of several stages: Processing, Graph Construction, Edge Labelling, and Graph Segmentation. The pipeline assumes that the input data is in CSV format similar to the TrackML data format (See https://www.kaggle.com/c/trackml-particle-identification).

Data Processing stage performs data processing on the comma-separated values (CSV) files that contain raw events from the PandaRoot simulation, and store processed data as PyTorch Geometric Data object. In this stage, new quantities are derived e.g. $r, \phi, p_t, d_0$, etc. At the moment, one can't run within a CUDA enabled envrionment, due to multiprocessing python library, one needs to run it in CPU-only envrionment.
Graph Construction stage will construct graphs either using a Heuristic Method or by using Metric Learning or Embedding. At the moment, this stage is not supported instead the graph construction using a Heuristic Method is merged with the Processing stage. Since this stage is not yet supported, one needs to distribute data into train, val and test folders by hand as Edge Labelling (GNN/DNN) stage assumes data distributed in these folders [Maybe in future this will change].
Edge Labelling stage will finish with GNNBuilder callback, storing the edge_score for all events. One can re-run this step by using e.g. traintrack --inference configs/pipeline_quickstart.yaml but one needs to put resume_id in the pipeline_quickstart.
Graph Segmentation stage is meant for track building using DBSCAN or CCL. However, one may skip this stage altogether and move to eval/ folder where one can perform segmenting as well as track evaluation. This is due to post analysis needs, as one may need to run segmenting together with evaluation using different settings. At the moment, it is recommended to skip this stage and directly move to eval/ directory (see eval/README.md for more details).

3. Understanding the Code

The stttrkx repo contains several subdirectories containing code for specific tasks. The detail of these subdirectories is as follows:

configs/ contains top-level pipeline configuration files for traintrack
eda/ contains notebooks for exploratory data analysis to understand raw data.
envs/ contains files for building a conda environment
eval/ contains code for track evalution, however, it also contain code for running segmenting stage independently of traintrack
LightningModules/ contains code for each stage of the pipeline
src/ contains helper code for utility functions, plotting, event building, etc
RayTune/ contains helper code for running hyperparameter tuning using Ray Tune library

Several notebooks are avaialble to inpect output of each stage as well as for post analysis not necessarily intended to run the stages interactively. For example,

stt1_proc.ipynb inspects the output of Processing stage
stt2_gnn_train.ipynb and stt3_gnn_infer.ipynb inspects the output of GNN stage
etc.

ai4fair/ctstrkx

1. Running the Pipeline

1.1 Running the Pipeline on CPU

1.2 Running Pipeline on Cluster

2. Understanding the Pipeline

3. Understanding the Code