This repo is to reproduce the GlueX tracking algorithm with PyTorch, which originally implemented with TensorFlow Keras here. It aims to a future integration with phasm.
To the best of my knowledge, mimic everything in the original Keras notebook, including the same:
- Shuffled training dataset;
- Batch size, epochs, NN size, loss function, optimizer, clip value;
- Learning rate scheduler.
- utils.py: define the NN structure and wrap the datasets into batched PyTorch dataloaders.
- LSTM_training.py: train the NN with the whole training dataset with 100 epochs. Save the trained model into a TorchScript.
- validation_processing.py: load the model from the TorchScript. Validate the model with a validation dataset of (661644, 6).
- submit-training-job.slurm: a one-step slurm script to run jobs on farm.
Based on my own experience, the bare-metal python3.9+pip3+cudnn8.6 installation would always fail on JLab ifarm A100 GPU nodes because of mismatch cudnn/torch versions. This is solved by installing the latest pytorch (as of Nov-28-2022) via conda virtual environments as guided here. A conda environment file is provided to show my environment configurations.
# install pytorch via conda
conda install pytorch torchvision torchaudio pytorch-cuda=11.6 -c pytorch -c nvidia
# create conda env from yml file
conda env create -f environment.yml
Table: the LSTM network definition, where batch_size=1256 and seq_len=7.
Layer | Input size | Output size | Param # |
---|---|---|---|
LSTM_1 | (batch_size, seq_len, 6) | (batch_size, seq_len, 128) | 69120 |
LSTM_2 | (batch_size, seq_len, 128) | (batch_size, seq_len, 64) | 49408 |
LSTM_3 | (batch_size, seq_len, 64) | (batch_size, 32) | 12416 |
Linear | (batch_size, 32) | (batch_size, 6) | 198 |
The parameter counts of the layers are taken from the original Keras model.summary()
.
Download the training dataset as below.
wget https://halldweb.jlab.org/talks/ML_lunch/Sep2019/MLchallenge2_training.csv
mv MLchallenge2_training.csv train_data.csv
Compared to the dataset at the time of executing the Keras notebook, the new dataset is about 38.5% larger (2646573 v.s. 1910698).
After sequencing, the dimension of the whole training dataset (as on 10/20/2022) is (2646573, 7, 6), with each epoch containing ~2108 batches. We train 100 epochs in total.
Table: results after 100 training epochs
Exp | loss |
mse |
val_loss |
val_mse |
lr |
Time | Training X size |
---|---|---|---|---|---|---|---|
Keras+TitanRTX*2 | 0.0015 | 6.8281e-06 | 0.0018 | 7.2508e-06 | 3.7715e-05 | ~20 mins | (1910698, 7, 6) |
PyTorch+TitanRTX | 0.0015 | 2.0858e-05 | 0.0015 | 2.0803e-05 | 3.7715e-05 | ~55 mins | (2646573, 7, 6) |
PyTorch+T4 | 0.0012 | 8.0547e-06 | 0.0012 | 7.6509e-06 | 4.4371e-05 | ~65 mins | (2646573, 7, 6) |
PyTorch+A100 | 0.0010 | 2.6062e-06 | 0.0010 | 2.5379e-06 | 5.2201e-05 | ~45 mins | (2646573, 7, 6) |
The code is tested on a single ifarm
TitanRTX/T4/A100 GPU. Results are available at:
- ./res/training-loss: images of the losses along the training process.
- ./res/job-logs: the detailed job logs. An example of how losses are changed along the epochs, batches and time is here.
- ./res/evaluation: images of the evaluation results. This is a comparison between the evaluation errors with Epochs=1 and Epochs=100.
- Keras APIs: https://keras.io/api/
- PyTorch documentation: https://pytorch.org/docs/stable/index.html
- PyTorch tutorials
Last updated on 02/01/2023 by xmei@jlab.org