/varta

Primary LanguagePythonApache License 2.0Apache-2.0

Vārta : A Large-Scale Headline-Generation Dataset for Indic Languages

This repository contains the code and other resources for the paper published in the Findings of ACL 2023.

Dataset | Pretrained Models | Finetuning | Evaluation | Citation

Dataset

The Vārta dataset is available on the Huggingface Hub. We release train, validation, and test files in JSONL format. Each article object contains:

  • id: unique identifier for the artilce on DailyHunt. This id will be used to recreate the dataset.
  • langCode: ISO 639-1 language code
  • source_url: the url that points to the article on the website of the original publisher
  • dh_url: the url that points to the article on DailyHunt
  • id: unique identifier for the artilce on DailyHunt.
  • url: the url that points to the article on DailyHunt
  • headline: headline of the article
  • publication_date: date of publication
  • text: main body of the article
  • tags: main topics related to the article
  • reactions: user likes, dislikes, etc.
  • source_media: original publisher name
  • source_url: the url that points to the article on the website of the original publisher
  • word_count: number of words in the article
  • langCode: language of the article

To recreate the dataset, follow this README file.

The train, val, and test folders contain language-specific json files and one aggregated file. However, the train folder has multiple aggregated training files for different experiments (you will have to recreate them). The data is structured as follows:

  • train:
    • train.json: large training file
    • train_small.json: small training file; training file for the all experiments
    • train_en_1M.json: training file for the en experiments
    • train_hi_1M.json: training file for the hi experiments
    • langwise:
      • train_<lang>.json: large language-wise training files
      • train_<lang>_100k.json: small language-wise training files
  • test:
    • test.json: aggregated test file
    • langwise:
      • test_<lang>.json: language-wise test files
  • val
    • val.json: aggregated validation file
    • langwise:
      • val_<lang>.json: language-wise validation files

Note: if you don't want to download the whole dataset, and just want one file, you can do something like

wget https://huggingface.co/datasets/rahular/varta/raw/main/varta/<split>/langwise/<split>_<lang>.json

Pretrained Models

  • We release the Varta-T5 model in multiple formats:
  • We release Varta-BERT only in pytorch as a HF model (link)

The code for:

  • Pretraining Varta-T5: follow the README here
  • Pretraining Varta-BERT follow the README here

Finetuning Experiments

The code for all finetuning experiments reported in the paper is placed under the baselines folder.

  • Extractive Baselines: follow the README here
  • Transformer Baselines: follow the README here

Evaluation

We use the multilingual variant of ROUGE implemented for the xl-sum paper for the evaluations of the headline generation and abstractive summarization tasks in our experiments.

Citation

@misc{aralikatte2023varta,
      title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages}, 
      author={Rahul Aralikatte and Ziling Cheng and Sumanth Doddapaneni and Jackie Chi Kit Cheung},
      year={2023},
      eprint={2305.05858},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}