9th place Solution for the NFL 1st and Future - Impact Detection Kaggle Competition
Authors : Uijin Choi, Theo Viel, Chris Deotte, Reza Vaghefi
This repository contains the code for the 3D classification & post-processing parts of the pipeline.
In this competition, you’ll develop a computer vision model that automatically detects helmet impacts that occur on the field. Kick off with a dataset of more than one thousand definitive head impacts from thousands of game images, labeled video from the sidelines and end zones, and player tracking data.
The aim of the competition is to detect helmet impacts in American football plays. For each play, we have access to two views : Endzone & Sideline, as well as the player tracking data.
From https://www.kaggle.com/c/nfl-impact-detection/discussion/209012
Our solution consists of a two-step pipeline (detection and classification) followed by some post-processing. We use 2D detection to find possible impact boxes, and then we use 3D classification to determine which boxes are impacts.
Our detection model finds the yellow box as a possible helmet impact.
From only 2D it is impossible to know whether these helmets are about to impact. It appears probable but maybe the silver helmet is about to pass behind the white helmet.
Our classification model uses frames before and after this frame. With this information, our classification model correctly identifies that these two helmets do not impact but rather pass by each other.
written by Uijin Choi
We used DetectoRS(ResNeXt-101-32x4d) model for detecting impacts for MMDetection. This model was built in two steps:
- Helmet Detection Model for warm-up: We trained a Helmet Detection Model for 12 epochs using the train helmet dataset which is in the image folder, this model shows
bbox_mAP_50 = 0.877
validation score. - 2-Class Detection Model: We used the final weights of the Helmet Detection Model (as pretrained weights) to build a Helmet/Impact Detection model. We used +/- 4 frames from impact as the positive class.
Using the Helmet Detection Model as pretrained weight makes the 2-Class Model converge much faster and detect impacts easier.
This approach showed good performance to detect impact. We set the confidence score up to 0.9 which showed 0.39 public LB score after a simple post processing (without any classifiers). Then, we set a lower confidence score to catch more true positives after we plug in the classifier. DetectoRS showed good performance to detect impact but it took a long time to test our ideas.
written by Theo Viel
I kept struggling with improving my EfficientDet models as I had no experience with object detection, I figured out it would be better to go back to what I can do : classification models. The main idea was that a crop around a helmet has the same information regarding whether it is an impact or not as the whole image. Therefore, I extracted a 64x64 crop around all the boxes in the training data and started building models to predict whether a crop had an impact. To tackle imbalance, I used the [-4, +4] extended label as a lot of people did. After a few tweaking, I had a bunch of models with a 0.9+ AUC (5 folds stratified by gameplay) : Resnet-18 & 34, EfficientNet b0 -> b3.
Tricks used for 2D models include : Limiting the number of boxes sampled per player at each epoch, in order to have more control on convergence Linear learning rate scheduling with warmup Removing the stride of the first layer : 64x64 is a small image size and it’s better not to decrease the network resolution too fast Stochastic Weighted Averaging Classical augmentations
Then, I used @tjungblut’s detector to benchmark how bad my models were on the public leaderboard. Turns out that after some post-processing, a resnet-18 + efficientnet-b1 + efficientnet-b3 blend achieved 0.33+, which at that time was in the gold zone.
Shortly after, I merged with the rest of the team with the goal of plugging my classification models on top of a strong detection model to leverage their potential. There were about two weeks left before the end of the competition, so we first focused on plugging @Uijin’s detection models with my classifiers. For a while, we couldn’t beat the detector LB’s of 0.39+, but after setting up a good CV scheme and improving the detection model training strategy, we reached LB 0.41+.
Around this time, I upgraded my 2D classifiers to be 3D. Models will now take as input frames [-8, -6, -4, -2, 0, 2, 4, 6, 8] instead of just frame 0. The pipeline was easy to adapt, the data loader was slightly modified, architectures were changed, and augmentations were removed for I was too lazy to dig into that.
The first batch of 3D model is based on 3D Resnets :
- resnet-18 & resnet-34, using as target the same as the one of the middle frame (that was extended)
- resnet-18, using an auxiliary classifier to predict the impact type
- resnet-18, with a target extended to frames [-6, +6]
The only additional trick I used is getting rid of all the strides on the temporal axis. Models are made for videos longer than 9 frames so once again I adapt the networks to my small input size.
They helped CV, didn’t really help public LB. In fact our jump to 0.49 at the time came from retraining our detection model on the whole dataset & tweaking some post-processing parameters. They did however help private LB, but this was after the leak so we didn’t know.
After recovering from NYE, I did some state of the art browsing and implemented 3 new 3D models that all outperformed the previous ones : i3d, Slowfast and Slowonly with Omnisource pretrained weights.
This was done on the 2nd and 3rd of January, so we had 10 submissions to try these out because of my procrastinating. My first submissions using them gave a small LB boost and we reached 0.50+.
Fortunately the rest of the team worked hard on the detection models to compute results of our detection models on all our folds. This allowed them to find a powerful set of hyperparameters which worked in the end !
written by Chris Deotte
Building a reliable validation was very important because it was easy to overfit the public LB or a hold out validation set of only a few videos. Our final model was configured using a full 120 video 5 fold validation. By reviewing our OOF predictions, we were able to tell what we needed to do to improve our model. One tool we used was to watch the ground truths and predictions together in all 3 views. The code is available here
This gave us insight into choosing hyperparameters for our detection and classification models and it gave us insight into creating post process.
written by Reza Vaghefi
We had to find good thresholds for both our detection and classification models. We observed that a single split is not enough and F1 score variation between folds is significant. Therefore, we used the CV calculated on the entire train set (5 folds) to optimize our thresholds. The following tricks improved our CV and LB: Different thresholds for detection and classification model: Originally we had the same threshold for detection and classification models (~0.5). As the classification model became better, we needed to lower detection thresholds (~0.35) to include more helmet detections and then use the classification model to classify the impacts. Different thresholds for Endzone and Sideline views: Endzone and Sideline videos are different in terms of image content, box area, box size, etc. We realized that using different thresholds can improve both CV and LB. We tried different combinations and best CV was achieved by using around 0.05 higher threshold in Sideline than Endzone Different thresholds over time: Both detection and classification models have no information about time-elapsed. From the training set, we know that the chance of impact decreases as frame number increases. We tried different combinations (fixed, piecewise, and linear) and ended up using a piecewise method where we used different thresholds for frame >= 150 and frame < 150.
Increasing the size of the bounding boxes helped our models, for some reason. In our last submission, we made the bounding boxes 1.22x bigger.
The goal of this PP was to cluster detections (through multiple frames) which belong to the same player and remove FP. We used an IOU threshold and frame-difference threshold to optimize our algorithm. We also tried temporal NMS and soft-NMS but our custom algorithm performed better.
Impacts that were detected with a confidence lower than T
were removed if no impact was found in the other view. That’s the best we came up with regarding merging information from both views.
We selected two submissions : Best public LB and best CV.
- CV 0.4885 - Public 0.5408 (10th) - Private 0.4873 (13th)
- CV 0.5125 - Public 0.4931 (13th) - Private 0.5153 (9th)
Our competition inference notebook is available here.
- Competition data is available on the Kaggle competition page
- Model weights are available on this Kaggle dataset
The reprository organization and its main components is detailed bellow :
-
mmaction2/
: MMaction2 repository, to avoid fully installing the package which can be a bit tricky. -
notebooks/
: Notebooks to perform data preparation, training & inferenceData preparation.ipynb
: Extracts images from videosData preparation Cls 3D.ipynbb
: Extracts 3D helmet crops from imagesInference Cls.ipynb
: Performs inferenceTraining Cls 3D.ipynb
: Trains a 3D classification model
-
output/
: Outputs of the training22_12/
: Results of the detection modelfolds.csv
: Cross-validation foldsdf_preds.csv
: 3D classifier predictions on the best detection model
-
src/
: Source codedata/
: Datasets and data preparation functionsinference
: Functions for inferencemodel_zoo
: 3D classification Modelspost_processing
: Adjacency, view and expansion post-processingtraining
: Training functionsutils
: Logger, torch utils and metricconfigs.py
: Model configs of the final blendparams.py
: Global parameters
Training a model is done in the notebooks. Paths have to be updated in src/params.py
.
-
First download the competition data from Kaggle.
- Specify
DATA_PATH
andTRAIN_VID_PATH
accordingly to where you store the data
- Specify
-
Run the
Data preparation.ipynb
notebook- This extracts the frames from the videos and computes a training dataframe
- Specify
IMG_PATH_F
accordingly to where you want to save the images
-
Run the
Data preparation.ipynb
notebook- This extracts 3D crops around the training helmets
- Specify
CROP_PATH_3D
accordingly to where you want to save the 3D crops
-
Run the
Training Cls 3D.ipynb
notebook- Specify the
Config
you want to use. The ones used in the final ensembles are insrc/configs.py
- Specify
LOG_PATH_CLS_3D
accordingly to where you want to log results
- Specify the
Evaluation of the model is done in the Inference Cls.ipynb
notebook.
You can re-use the pre-computed predictions preds.csv
, or compute new ones by specifying CP_FOLDER
and configs
in the Classifier 3D inference
section.