christos-c/bible-corpus

K'iche' version in old orthography

Closed this issue · 8 comments

The K'iche' version you have is an old one in an old orthography. Any chance of getting a new one added, e.g. this one.

Hi @ftyers, thanks for the suggestion - I'll have a look over the weekend and will update the corpus if possible.

Hi @ftyers I ran my corpus generator on the new source you provided. Do you mind checking the resulting file? If it looks ok, I'm going to replace the original with this (unless you think it's worth having the old version too). I noticed some word changes in some of the verses and the total number of words is smaller (206k compared to 271k in the old version).

K'iche'-NT.zip

I've just checked it briefly, it looks much better. Only one comment, I would change the U+2019 RIGHT SINGLE QUOTATION MARK to U+02BC MODIFIER LETTER APOSTROPHE to make for easier processing (this applies to all of the Mayan languages...). In the version I scraped there are around 206k tokens so that sounds about right. In terms of differences from the older one, it could be as a result of orthographic changes, but that's a really big reduction so I'm not sure.

Could you give some examples of sentences that have more tokens in the old orthography than they do in the new? I can take a look to see if there is an explanation.

Thanks for the suggestion @ftyers, I have replaced all single quotation marks with apostrophes. I'm pasting a couple of examples of difference in verse length between the two versions. If everything looks ok to you, I'll update the version in the corpus.

MAT.1.1
OLD: Are waˈ ri qui biˈ ri u mam ri Jesús ojer. Ri Jesús Are rachalaxic ri ka mam David xukujeˈ ri ka mam Abraham ojer.
NEW: Wuj rech ri umajib’al uloq ri Jesus Kristo, uk’ojol ri David, uk’ojol ri Ab’raham.

MAT.1.3
OLD: Ri ka mam Judá are u tat ri ka mam Fares xukujeˈ ri ka mam Zara. Ri qui nan cˈut are ri nan Tamar. Ri ka mam Fares are u tat ri ka mam Esrom. Ri ka mam Esrom are u tat ri ka mam Aram.
NEW: ri Juda xralk’uwa’laj, ruk’ ri Tamar, ri Fares e ri Sara, ri Fares xralk’uwa’laj ri Esrom, ri Esrom xralk’uwa’laj ri Aram,

In MAT 1.1, there is very different phrasing here, it looks like the old version is more like the Spanish DHH version

quc (partially modernised): Are wa' ri k'i b'i ri umam ri Jesús ojer. Ri Jesús Are rachalaxik ri k'a mam David xuquje' ri k'a mam Abraham ojer.
spa: Ésta es una lista de los antepasados de Jesucristo, que fue descendiente de David y de Abraham:
RBMT quc-spa:  éste el nombre mucho el abuelo de Jesús antes. Jesús  familia los *k'a abuelos David también los *k'a abuelos Abraham antes.

But the new version aligns better with the Spanish NVI:

quc: Wuj rech ri umajib’al uloq ri Jesus Kristo, uk’ojol ri David, uk’ojol ri Ab’raham.
spa NVI:Tabla genealógica de Jesucristo, hijo de David, hijo de Abraham:
RBMT quc-spa: Libro del principio aquí Jesús Cristo, hijo David, hijo Abraham.

MAT 1.3 looks similar, the same information just different and more compact phrasing.

Oh very interesting, thanks for the comparisons! It seems to me that there is some value in keeping the old version too? Is there a name or designation I can use to differentiate the two?

Maybe use the publication years? I think the old one was published in 1997 and the new one in 2011. Another way would be using the publisher, first one is Wycliffe Bible Translators / SIL second is Conferencia Episcopal de Guatemala. However, note that there are issues (a, b) with how SIL operates in Guatemala. You could also go with the name of the orthography, e.g. SIL or ALMG. Or all three :)

Thanks for the suggestions! I have added the new version under K'iche'-NT-AMLG and added a note in the release and updated my website. Thanks again for your help with this.