/parallelbibles

Word-alignment models for Bible translations in 100+ historical and contemporary languages

Primary LanguageR

parallelbibles

Word-alignment models for Bible translations in 100+ historical and contemporary languages

Requirements

  1. Installation and dependencies:

    • Download or clone the repository:

      $ git clone https://github.com/npedrazzini/parallelbibles

    • From the root directory (./parallelbibles), build the repository:

      $ make

    This will download and build SyMGIZA++ [1] and install all the required dependencies in a venv called parallels-venv.

  2. XML files, which can be of two formats:

    This repository comes with OPUS XMLs (inside original-xmls/opus-xmls) and PROIEL XMLs for New Testament Greek, Old Church Slavonic and Gothic (inside original-xmls/proiel-xmls).

Train word-alignment models

This repository already comes with four pre-trained models. Check them out!

$ ./train.sh

This step will:

  1. convert OPUS/PROIEL XML files to GIZA-readable CSV files
  2. train a word-alignment model for each target language
  3. make GIZA's outputs easily readable and queryable

You will be prompted to:

  1. specify the input XML format (OPUS, PROIEL, or mixed)
  2. enter the desired source language
  3. enter the target languages (or have all the remaining as targets)
  4. specify if you want to strip punctuation
  5. specify if you want to bring everything to lowercase
  6. provide a name for your model

NB: the chosen languages must be entered in their ISO 639-3 code. See here for the complete list and the table below for the languages included in the models.

Extract words and their translations

$ ./extract.sh

This step will:

  1. extract every occurrence of a word (or multiple words) in the source language and its translation in the target languages.
  2. (optionally) generate scripts to run multidimensional scaling (MDS) on the dataset and Kriging (to draw lines around clusters probabilstically)

You will be prompted to enter:

  1. the name of the model you want to use (e.g. 'model2-LC-NP')
  2. a target word (e.g. 'when') or multiple target words separated by hyphen (e.g. 'when-while-since')
  3. whether you want to generate the scripts necessary to run MDS on the dataset ('yes' or 'no')
  4. whether you also want to apply Kriging to the MDS maps ('yes' or 'no')
  5. whether you only want to extract words from the New Testament ('yes') or from both the Old and the New Testament ('no') *

The output will be a folder named as the target word (or words, hyphen-separated, if extracting multiple words at once) containing the following:

  1. word.csv: CSV file for each word. The file will contain one occurrence per line, its citation (Bible verse), context, and the translations in each target language **.

And if you chose to run MDS (with and without Kriging) it will also contain:

  1. word-MDS.R: an R script to run MDS (and Kriging, if you chose to), generating a single PDF with one map per language. These maps are static and generated using base R. Best for distant-reading stages in the data exploration ***.
  2. word-plotly.R: an R script (alternative to word-MDS.R) generating multiple HTML files using the R package plotly. These maps are interactive and let you hover over the data points and look at the citation (Bible verse) and source word in context. Best for close-reading stages in the data exploration.
  3. word-data.txt: the original data in TXT format and the citation (Bible verse) as index (rather than column, as in word.csv) and without the 'context' column.
  4. word-matrix.txt: distance matrix between source word and target words.

* This is because many languages lack the whole or large sections of the Old Testament, which will result in your dataset having many NAs (which you may or may not want to avoid).

** NB: NULL will indicate that the model did not find a match for the word in the target language. NA will indicate that the target language did not have a Bible translations of that particular verse in the first place (e.g. some languages lack a translation for the whole Old Testament).

*** NB 1: This script is a heavy adaptation of the code by [2]. NB 2: The lmap function relies on the R package qlcVisualize. If you have issues installing it, simply save the two functions we need from that package by running the script ./scripts/postprocessing/lmap-boundary-functions.R included in this repository. NB 3: The MDS script has been adapted so that it merges all translations with less than 10 occurrences with NULLs. The '10' threshold is arbitrary and was based on what seemed to be a common cut-off point between 'real' translations in the target language and casual correspondence between the source word and a specific lexical item in the target language.

Hierarchical clusters and NeighborNets

./scripts/postprocessing/splitstree.R: this script will perform hierarchical clustering and NeighborNet analysis of the languages based on a criterion x (default: NULL-constructions).

It takes as input the file word-data.txt described above.

The script will:

  1. Plot a simple hierarchical cluster of the languages in a parallel-word dataset. It currently shows how similar languages appear to be based on NULL-construction distributions.
  2. Generate a Nexus (.nex) file for NeighborNet analysis, to be visualized with the SplitsTree4 software. Similar to a traditional hierarchical cluster in many ways, a NeighborNet will simply not force a binary-tree type of classification.

Pretrained models

NB: model2-LC-NP is stored in this repo using Git LFS. If you wish to use that model, you should have Git LFS installed, else you will only see a pointer file.

Four pretrained models currently come with this repository:

  1. model1-UC-P: Upper case and with Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.
  2. model2-LC-NP: Lower Case and No Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.
  3. model3-UC-NP: Upper Case and No Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.
  4. model4-LC-P: Lower Case and No Punctuation. English is source language. All other languages (both from OPUS and PROIEL; however see TODO) are targets.

You can directly extract target words from either of these models by running $ ./extract.sh. You will be prompted to enter the name of the model you want to use.

Languages

OT = Old Testament

NT = New Testament

ISO 639-3 Language Language family OT NT Notes
acu Achuar-Shiwiar Jivaroan N Y
afr Afrikaans Indo-European > Germanic Y Y
agr Awajún Jivaroan N Y
ake Akawaio Cariban N Y
sqi/alb Albanian Indo-European Y Y
amh Amharic Afro-Asiatic > Semitic Y N
amu Guerrero Amuzgo Otomanguean N Y
ara Arabic Afro-Asiatic > Semitic Y Y
hye/arm Armenian Indo-European Y Y
baq Basque Isolate N Y
bsn Barasana-Eduria Tucanoan N Y
bul Bulgarian Indo-European > Balto-Slavic Y Y
cak Kaqchikel Mayan N Y
ceb Cebuano Austronesian > Malayo-Polynesian Y Y
cha Chamorro Austronesian > Malayo-Polynesian Y Y OT only consists of the Psalms
zho/chi Chinese Sino-Tibetan > Sinitic Y Y
chq Quiotepec Chinantec Otomanguean N Y
chr Cherokee Iroquoian N Y
chu Church Slavonic Indo-European > Balto-Slavic N Y
cjp Cabécar Chibchan N Y
cni Asháninka Maipurean N Y
cop Coptic Afro-Asiatic > Egyptian N Y
crp Creoles and pidgins Creole > French-based Y Y The original XML files have the generic 'crp' code. This is however Haitian Creole (code hat)
cze Czech Indo-European > Balto-Slavic Y Y
dan Danish Indo-European > Germanic Y Y
deu German Indo-European > Germanic Y Y
dik Southwestern Dinka Nilo-Saharan > Nilotic N Y
dje Zarma Nilo-Saharan > Songhai Y Y
dop Lukpa Niger-Congo > Atlantic-Congo N Y
epo Esperanto Constructed Y Y
est Estonian Uralic Y Y
ewe Ewe Niger-Congo > Atlantic-Congo N Y
fin Finnish Uralic Y Y
fra French Indo-European > Italic Y Y
gbi Galela West Papuan N Y
gla Scottish Gaelic Indo-European > Celtic N Y The only text included is the Gospel of Mark
glv Manx Indo-European > Celtic Y Y The only text from the OT is the Book of Esther
got Gothic Indo-European > Germanic N Y
grc Ancient Greek (to 1453) Indo-European N Y
ell/gre Modern Greek (1453-) Indo-European Y Y
guj Gujarati Indo-European > Indo-Iranian N Y
heb Hebrew Afro-Asiatic > Semitic Y N
hin Hindi Indo-European > Indo-Iranian Y Y
hrv Croatian Indo-European > Balto-Slavic Y Y
hun Hungarian Uralic Y Y
ind Indonesian Austronesian > Malayo-Polynesian Y Y
isl Icelandic Indo-European > Germanic Y Y
ita Italian Indo-European > Italic Y Y
jak Jakun Austronesian > Malayo-Polynesian N Y
jap Japanese Japonic Y Y
jiv Shuar Jivaroan N Y
kab Kabyle-Amazigh Afro-Asiatic > Berber N Y
kbh Camsá Isolate N Y
kor Korean Koreanic Y Y
lat Latin Indo-European > Italic Y Y
lav Latvian Indo-European > Balto-Slavic N Y
lit Lithuanian Indo-European > Balto-Slavic Y Y
mal Malayalam Dravidian Y Y
mam Mam Mayan N Y
mao Maori Austronesian > Malayo-Polynesian Y Y
mar Marathi Indo-European > Indo-Iranian Y Y
mya Burmese Sino-Tibetan > Tibeto-Burman Y Y
nep Nepali Indo-European > Indo-Iranian Y Y
nhg Tetelcingo Nahuatl Uto-Aztecan N Y
nld Dutch Indo-European > Germanic Y Y
nor Norwegian Indo-European > Germanic Y Y
ojb Northwestern Ojibwa Algic > Algonquian N Y
pck Paite Chin Sino-Tibetan > Tibeto-Burman Y Y
pes Iranian Persian Indo-European > Indo-Iranian Y Y
plt Plateau Malagasy Austronesian > Malayo-Polynesian Y Y
pol Polish Indo-European > Balto-Slavic Y Y
por Portuguese Indo-European > Italic Y Y
pot Potawatomi Algic > Algonquian N Y
ppk Uma Austronesian > Malayo-Polynesian N Y
quc K'iche' Mayan N Y
quw Tena Lowland Quichua Quechuan N Y
rom Romany Indo-European > Indo-Iranian N Y
ron/rum Romanian Indo-European > Italic Y Y
rus Russian Indo-European > Balto-Slavic Y Y
shi Tachelhit Afro-Asiatic > Berber N Y
slk Slovak Indo-European > Balto-Slavic Y Y
slv Slovenian Indo-European > Balto-Slavic Y Y
sna Shona Niger-Congo > Atlantic-Congo Y Y
som Somali Afro-Asiatic > Cushitic Y Y
spa Spanish Indo-European > Italic Y Y
srp Serbian Indo-European > Balto-Slavic Y Y
ssw Swati Niger-Congo > Atlantic-Congo N Y
swe Swedish Indo-European > Germanic Y Y
syr Syriac Afro-Asiatic > Semitic N Y
tel Telugu Dravidian Y Y
tgl Tagalog Austronesian > Malayo-Polynesian Y Y
tha Thai Kra-Dai > Tai Y Y
tmh Tamashek Afro-Asiatic > Berber Y Y
tur Turkish Turkic Y Y
ukr Ukrainian Indo-European > Balto-Slavic N Y
usp Uspanteco Mayan N Y
wal Wolaytta Afro-Asiatic > Omotic N Y
wol Wolof Niger-Congo > Atlantic-Congo N Y
xho Xhosa Niger-Congo > Atlantic-Congo Y Y
zul Zulu Niger-Congo > Atlantic-Congo N Y

TODO

  1. Include the following languages: a. In all models: vie, kan, djk, kek, agr, mal b. In model4-LC-P only: mar, mya, nep, tel
  2. Fix issue with display of some non-Latin characters in PDF output (notably all Arabic!). Note that the characters display normally in R studio (i.e. it must be an issue with both base R pdf and CairoPDF).
  3. Add info on how NULLs are treated in the models.
  4. Add on how many NAs we have per language based on best model.

References

[1] Junczys-Dowmunt, Marcin & Arkadiusz Szał. 2012. SyMGiza++: Symmetrized Word Alignment Models for Machine Translation. In Pascal Bouvry, Mieczyslaw A. Klopotek, Franck Leprévost, Malgorzata Marciniak, Agnieszka Mykowiecka & Henryk Rybinski (eds.), Security and Intelligent Information Systems (SIIS) (Lecture Notes in Computer Science 7053), 379-390. Heidelberg-Berlin: Springer.

[2] Wälchli, Bernhard. 2010. Similarity Semantics and Building Probabilistic Semantic Maps from Parallel Texts. Linguistic Discovery 8(1). 331-371. DOI:10.1349/PS1.1537-0852.A.356