Task 4 - Shared task on Multimodal Hate Speech Event Detection at CASE 2024
Hate speech detection is one of the most important aspects of event identification during political events like invasions. In the case of hate speech detection, the event is the occurrence of hate speech, the entity is the target of the hate speech, and the relationship is the connection between the two. Since multimodal content is widely prevalent across the internet, the detection of hate speech in text-embedded images is very important. Given a text-embedded image in the context of the Russia-Ukraine crisis, this task aims to automatically identify hate speech and its targets. This task will have two subtasks (i) Hate speech identification (ii) Targets of Hate Speech Identification. This is an ongoing challenge that was held in CASE 2023 as well.
Hate Speech Detection: The goal of this task is to identify whether the given text-embedded image contains hate speech or not. The text-embedded images, which are the dataset for this subtask, will have annotations for the prevalence of hate speech.
The dataset for this sub-task has two labels viz. "Hate Speech" and "No Hate Speech".
Target Detection: The goal of this subtask is to identify the targets of hate speech in a given hateful text-embedded image. The text-embedded images are annotated for "community", "individual" and "organization" targets.
To know more about the dataset, please refer to our paper. The sample codes for both the subtasks are provided in the repo.
Join our codalab competition here.
Training data is provided at: https://drive.google.com/drive/folders/173EJjsNblxhjACXzIWardUqCcSYtcJh0
Evaluation/Validation data is provided at: https://drive.google.com/drive/folders/1LL2OD7v2GhrmeC0j2Gm9YFCOa5vobVjc
A link for testing data is provided at: https://drive.google.com/drive/folders/1DIVebYypb2x9RJjoSeOmr5yEm5rCXt54
All the images have a unique identifier called "index". The labels for training data are organized in the folder provided. For evaluation and testing, the submission format is mentioned below.
If you want to extract OCR, you can use Google Vision API, tesseract, etc. In the paper that benchmarks this dataset, we have used Google Vision API to extract OCR for training the models. The code can be found here.
A lot of participants do not have access to vision API. They can use the extracted text from here.
Extracted text for train, test and validation data: https://drive.google.com/drive/folders/1LeGNIyYZ3Fh7RnwXsBBJDQr1-acTrw00
The results are only accepted in codalab. The submission will be evaluated with a f1-score.
The script takes one prediction file as the input. Your submission file must be a JSON file which is then zipped. We will only take the first file in the zip folder, so do not zip multiple files together.
IMPORTANT: The index (image name) in json should be in ascending order.
For subtask A, the final prediction submissions should be like the following. Make sure that your hate label is given as "1" and non-hate label is given as "0".
{"index": 23568, "prediction": 1}
{"index": 45865, "prediction": 0}
{"index": 98452, "prediction": 1}
Similarly, for the subtask B, the final prediction submissions should be like the following. Make sure that your individual, community, and organization labels are given as "0", "1", and "2" respectively.
{"index": 23568, "prediction": 1}
{"index": 36987, "prediction": 2}
{"index": 45865, "prediction": 0}
IMPORTANT: The index (image name) in json should be in ascending order.
Participants in the Shared Task are expected to submit a paper to the workshop. Submitting a paper is not mandatory for participating in the Shared Task. Papers must follow the CASE 2023 workshop submission instructions and will undergo regular peer review. Their acceptance will not depend on the results obtained in the shared task but on the quality of the paper. Authors of accepted papers will be informed about the evaluation results of their systems prior to the paper submission deadline (see the important dates). All the accepted papers will be published in ACL Anthology.
Top-performing teams and best models will be invited for a special issue in journals (T.B.D.).
- Training & Evaluation data available: Nov 1, 2023
- Test data available: Nov 30, 2023
- Test start: Nov 30, 2023
- Test end: Jan 7, 2024
- System Description Paper submissions due: Jan 13, 2024
- Notification to authors after review: Jan 26, 2024
- Camera ready: Jan 30, 2024
- CASE Workshop: 21-22 Mar, 2024
- Surendrabikram Thapa (Virginia Tech, USA)
- Farhan Ahmad Jafri (Jamia Millia Islamia, India)
- Ali Hürriyetoğlu (KNAW Humanities Cluster DHLab, Netherlands)
- Hariram Veeramani (UCLA, USA)
- Kritesh Rauniyar (Delhi Technological University, India)
- Usman Naseem (James Cook University, Australia)
If there are any questions related to the competition, please contact surendrabikram@vt.edu
If you use the dataset, please cite it as follows:
@inproceedings{bhandari2023crisishatemm,
title={CrisisHateMM: Multimodal Analysis of Directed and Undirected Hate Speech in Text-Embedded Images From Russia-Ukraine Conflict},
author={Bhandari, Aashish and Shah, Siddhant B and Thapa, Surendrabikram and Naseem, Usman and Nasim, Mehwish},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
pages={1993--2002},
year={2023}
}
All the papers submitted as shared task reports should cite the shared task as follows:
@inproceedings{thapa2023multimodal,
title={ Multimodal Hate Speech Event Detection - Shared Task 4, CASE 2023},
author={Thapa, Surendrabikram and Jafri, Farhan Ahmad and H{\"u}rriyeto{\u{g}}lu, Ali and Vargas, Francielle and Lee, Roy Ka-Wei and Naseem, Usman},
booktitle={Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)},
year={2023}
}