This repository contains the code and other resources for the paper published in the Findings of ACL 2023.
Dataset | Pretrained Models | Finetuning | Evaluation | Citation
The Vārta dataset is available on the Huggingface Hub. We release train, validation, and test files in JSONL format. Each article object contains:
id:
unique identifier for the artilce on DailyHunt. This id will be used to recreate the dataset.langCode
: ISO 639-1 language codesource_url
: the url that points to the article on the website of the original publisherdh_url
: the url that points to the article on DailyHuntid
: unique identifier for the artilce on DailyHunt.url
: the url that points to the article on DailyHuntheadline
: headline of the articlepublication_date
: date of publicationtext
: main body of the articletags
: main topics related to the articlereactions
: user likes, dislikes, etc.source_media
: original publisher namesource_url
: the url that points to the article on the website of the original publisherword_count
: number of words in the articlelangCode
: language of the article
To recreate the dataset, follow this README file.
The train
, val
, and test
folders contain language-specific json files and one aggregated file. However, the train
folder has multiple aggregated training files for different experiments (you will have to recreate them). The data is structured as follows:
train
:train.json
: large training filetrain_small.json
: small training file; training file for the all experimentstrain_en_1M.json
: training file for the en experimentstrain_hi_1M.json
: training file for the hi experimentslangwise
:train_<lang>.json
: large language-wise training filestrain_<lang>_100k.json
: small language-wise training files
test
:test.json
: aggregated test filelangwise
:test_<lang>.json
: language-wise test files
val
val.json
: aggregated validation filelangwise
:val_<lang>.json
: language-wise validation files
Note: if you don't want to download the whole dataset, and just want one file, you can do something like
wget https://huggingface.co/datasets/rahular/varta/raw/main/varta/<split>/langwise/<split>_<lang>.json
- We release the Varta-T5 model in multiple formats:
- We release Varta-BERT only in pytorch as a HF model (link)
The code for:
The code for all finetuning experiments reported in the paper is placed under the baselines folder.
We use the multilingual variant of ROUGE implemented for the xl-sum paper for the evaluations of the headline generation and abstractive summarization tasks in our experiments.
@misc{aralikatte2023varta,
title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages},
author={Rahul Aralikatte and Ziling Cheng and Sumanth Doddapaneni and Jackie Chi Kit Cheung},
year={2023},
eprint={2305.05858},
archivePrefix={arXiv},
primaryClass={cs.CL}
}