This repository includes the PMIndiaSum dataset under data/
and scripts for monolingual, cross-lingual, and multilingual baseline models under baselines/
.
Our materials are released under the CC-BY-4.0, in other words, these can be freely shared and adapted as long as appropriate credit is given. Full license: https://creativecommons.org/licenses/by/4.0/. The data is originally derived from the PM India website which has their license at https://www.pmindia.gov.in/en/website-policies/.
Our work is published as an EMNLP 2023 Findings paper. If you use our code or corpus, please kindly cite:
@inproceedings{urlana-etal-2023-pmindiasum,
title = "{PMI}ndia{S}um: Multilingual and Cross-lingual Headline Summarization for Languages in {I}ndia",
author = "Urlana, Ashok and
Chen, Pinzhen and
Zhao, Zheng and
Cohen, Shay and
Shrivastava, Manish and
Haddow, Barry",
editor = "Bouamor, Houda and
Pino, Juan and
Bali, Kalika",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
month = dec,
year = "2023",
address = "Singapore",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-emnlp.777",
doi = "10.18653/v1/2023.findings-emnlp.777",
pages = "11606--11628",
}