This is an unofficial implementation for the CVPR 2020 paper Multimodal Categorization of Crisis Events in Social Media.
Abavisani, Mahdi, et al. "Multimodal categorization of crisis events in social media." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
To cite the paper:
@inproceedings{abavisani2020multimodal,
title={Multimodal categorization of crisis events in social media},
author={Abavisani, Mahdi and Wu, Liwei and Hu, Shengli and Tetreault, Joel and Jaimes, Alejandro},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={14679--14689},
year={2020}
}
This implementation follows the original paper whenever possible. Due to our urgent need for experiment results, we haven't had time to make it super configurable with clean handlers.
- Initialize by running
bash setup.sh
- Run the pipeline with
python main.py
We applied mixed-precision training, so it runs fast on GPUs with tensorcores (e.g. V100). The default configuration consumes about 13GB of GPU memory, and each epoch takes 3 minites on an Amazon g4dn-xlarge
instance (with V100 GPU).
Warning: Model is saved for each epoch, which means it consumes 400MB of disk every 3 minutes. Take this into consideration.
The authors stated that
After obtaining a multimodal representation that incorporates both visual and textual information, the authors used fully-connected layers to perform classification. Here the authors wrote
We add self-attention in the fully-connected networks.
We assumed that they meant 'we added a fully-connected layer as self-attention'.
The authors did not give the size of the DenseNet they used.
- Setting
num_workers > 1
deadlocks the dataloader.