Baseline method for VideoNet Competition
ffmpeg (>=2.8.15)
ffprob (>=2.8.15)
python+OpenCV 3.4.1
Download datasets from the VideoNet website and put in root folder.
Use download_video.py from each subset.
We extract one frame every 60 frames for efficiency. You can adjust the step with --step.
python3 data_processing.py root output_dir --mode {train, val, test} --step 60
python3 extract_keyframe.py root output_dir --mode {train, val, test}
5. All the model used in baseline and submission sample can be downloaded from Baidu Yun (Code: 78yn) or Google Drive.
Our baseline uses Temporal Segment Network to predict the event class of videos. More details can refer to the paper and original repo.
PyTorch >= 0.4.1
We only train our model on RGB images. The model is finetuned from Kinetics.
python3 generate_event_list.py root keyframe_dir --mode {train, val, test}
Download VideoNet_bninception__rgb_model_best.pth.tar and put it under the event folder.
python3 test_models.py VideoNet RGB ./split/test_list.txt VideoNet_bninception__rgb_model_best.pth.tar --arch BNInception --save_scores event_results.json --gpus 0 1 2 3 -j 4
The results will be saved in the event_results.json.
Download kinetics_tsn_rgb.pth.tar and put it under ./tf_model_zoo.
python3 main.py VideoNet RGB ./split/train_list.txt ./split/val_list.txt --arch BNInception --num_segments 3 --gd 20 --lr 0.001 --lr_steps 30 60 --epochs 80 -b 128 -j 8 --dropout 0.8 --snapshot_pref VideoNet_bninception_ --gpus 0 1 2 3
Our baseline uses PlaceCNN of ResNet-50 to predict the scene classes of keyframes. More details can refer to the paper and original repo.
PyTorch >= 0.4.1
We finetune our model from Place365.
python3 generate_scene_list.py root keyframe_dir --mode {train, val, test}
Download resnet50_best.pth.tar and put it under the scene folder.
python run_placesCNN.py scene_label_list_test.txt weights --arch resnet50
The results will be saved in the ./scene_results folder.
Download resnet50_places365.pth.tar and put it under the scene folder.
python train_placesCNN.py -a resnet50 --train scene_label_list_train.txt --val scene_label_list_val.txt
Our baseline uses YOLOv3 to predict the object classes of keyframes. More details can refer to the paper and original repo.
PyTorch = 1.1.0
We finetune our model from Darknet.
python3 convert_yolo_format.py root keyframe_dir --mode {train, val, test}
Download yolov3_ckpt_6.pth and put it under the object folder.
python3 detect.py --weights_path checkpoints/yolov3_ckpt_6.pth --model_def config/yolov3-custom.cfg --image_list test_list.txt --class_path objects_en.txt
The results will be saved in the ./object_results folder with one instance each image.
As for training, you may need to modify the path in config/custom.data. You also need to modify the 65 line in utils/datasets.py with keyframe dir.
path.replace(train_frame_folder, './train_label/').replace(val_frame_folder, './val_label/').replace(".png", ".txt").replace(".jpg", ".txt")
And Download darknet53.conv.74 and put it under the ./weights.
python3 train.py --data_config config/custom.data --pretrained_weights weights/darknet53.conv.74 --model_def config/yolov3-custom.cfg
Run generate_submission.py to generate submission as baseline.json (The submission sample file can be downloaded from Baidu Yun or Google drive). The two threshold will filter the outputs with low confidence since some images has no scenes or objects in our label lists.
python3 generate_submission.py root --scene_thres --object_thres
The results of our baseline:
Event | Object | Scene | Total Score | |
---|---|---|---|---|
Validation | 77.30 | 25.58 | 55.37 | 0.4784 |
Test | 78.17 | 23.42 | 55.29 | 0.4712 |
The object_thres = 0.9 and scene_thres = 0.7.