/pib-crawl

Code to extract multilingual parallel corpus from Press Information Bureau (PIB) website.

Primary LanguagePython

PIB Crawler

Overview

This repository houses a flask application incrementally built to extract aligned sentences across multiple languages with a translation system in place.

The application was originally built to crawl and store multilingual news articles available at Press Information Bureau website. It can however be repurposed to prototype, inspect and build for other multilingual sources as well.

We require the web application for the reasons below:

  1. Multilingual samples require verification on the alignment and the retrieved samples which can easily be done once a web interface is created.
  2. Storage obviously has to be done in a DBMS due to the nature of the data and incremental updates performed efficiently.
  3. All tokenization and under the hood processing needs to be repeated but hidden from a layman user or expert to gather simple feedback.

Installation

# --user is optional
python3 -m pip install -r requirements.txt --user

After installing the required packages, run the following script to download the PIB database containing the crawled articles. This script also downloads pretrained multilingual model used for alignment.

bash scripts/get-resources.sh

Usage

Once we have the DB and pretrained model in place, to extract parallel corpus from the database run the following command.

bash scripts/export-parallel-corpus.sh

Resources

  1. The CVIT-PIB and CVIT-MKB (Mann-Ki-Baat) datasets are available here.
  2. Database containing the crawled news articles, which are used to extract parallel corpus.
  3. The Multilingual NMT model used for sentence alignment and the associated vocabulary files.
  4. We additionally release multilingual model augmented with the PIB corpus.

Publications

If you use CVIT-PIB and MKB, please cite our paper:

@inproceedings{siripragada-etal-2020-multilingual,
    title = "A Multilingual Parallel Corpora Collection Effort for {I}ndian Languages",
    author = "Siripragada, Shashank and Philip, Jerin and Namboodiri, Vinay P. and Jawahar, C V",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference",
    month = may,
    year = "2020",
    address = "Marseille, France",
    publisher = "European Language Resources Association",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.462",
    pages = "3743--3751",
    language = "English",
    ISBN = "979-10-95546-34-4",
}