/English-Odia

English to Odia/Oriya parallel corpus of phrases

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

Important Notice:

This repository is not under active development and it has been merged into its parent repository, MTEnglish2Odia.

English-Odia

This repository is forked from MTEnglish2Odia repository. This repository contains English (Lang Code: en) to Odia/Oriya (Lang code: or) parallel corpus of phrases.
There are multiple files based on the source of the data.
Combined file: consolidated_full_corpus.txt
Sample structure:

<english phrase>||<odia phrase>

for example:

urban development planning||ସହରାଞ୍ଚଳ ବିକାଶ ଯୋଜନା
Family||ପରିବାର

Current corpora statistics

  • 4500+ cleaned en-or parallel pairs (growing every weekend)
  • ~50,000 uncleaned pairs

Referred articles/websites:

Data Collected from:

Prospective data corpus

These are few places where relevant data may be present, however getting the data is not straight forward.

  • EMILLE Project : The Oriya written corpus consists of data incorporated from the CIIL Corpus, originally gathered by the Institute of Applied Language Sciences, Bhubaneshwar (approximately 2,730,000 words).
  • Gyan Nidhi-TDIL : Million pages’ multilingual parallel text corpus in English and 11 Indian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Marathi, Malayalam, Oriya, Punjabi, Tamil & Telugu) based on Unicode encoding. The Gyan Nidhi corpus contains the text in the form of books. In these books there were number of diagrams, figures, charts and other special symbols. These are removed from the text by using automated and manual tools. The text in gyan nidhi is in the form of paragraphs, that are converted into short sentences.

Contributors

How can I contribute to this repository?

What can I contribute?

  • You can send English-Odia word/phrase/sentence pairs on the below format in a new file, under your name and types of data. For e.g. if your name is Satyabrata, you want to upload generic phrases:
Key Example
Filename satyabrata.txt
File upload path data/Individual_files/satyabrata.txt
File text format `Why are you so lazy?

Please make sure you have correct permissions to upload this data in GPL license.

  • Tutorial on how to fork a repository and send a PR can be found in this video or this video or this Github doc tutorial for fork and this one for pull request
  • Your Pull Request will be reviewed first.
  • Please follow up if any comments or modifications are needed on your Pull Request.
  • In case of any confusion please contact on proud_odia@outlook.com. You will get a response within a day or two.

Fork and Pull Request-1

Fork and Pull Request-2