cysouw/qlcData

Context classes are ignored in transliteration

Closed this issue · 1 comments

tokenize() doesn't appear to be using context classes. An example:

italian1 <- data.frame(Grapheme = c("c","c","i","o"), Right = c("","frontV","",""), Class = c("", "", "frontV", ""), IPA = c("k", "tʃ", "i", "o"))
italian2 <- data.frame(Grapheme = c("i","o","c","c"), Right = c("","","","V_front"), Class = c("V_front", "", "", ""), IPA = c("i", "o", "k", "tʃ"))
tokenize(c("cico", "coci", "coco"), profile = italian1, transliterate = "IPA", regex = TRUE)$strings
tokenize(c("cico", "coci", "coco"), profile = italian2, transliterate = "IPA", regex = TRUE)$strings
tokenize(c("cico", "coci", "coco"), profile = italian1, transliterate = "IPA", regex = TRUE, ordering = c("context", "size"))$strings

Each of the tokenize() commands above produces the output:

| originals | tokenized | transliterated
-|-----------|------------|------------------
1 | cico | c i c o | k i k o
2 | coci | c o c i | k o k i
3 | coco | c o c o | k o k o

qlcData.pdf states "context: order the lines by whether they have any context specified, lines with context coming first. Note that this only works when the option context = TRUE is also chosen." but content isn't an option in the current version of tokenize().

Using plain regex works fine:

italian3 <- data.frame(Grapheme = c("c","c","i","o"), Right = c("","i","",""), Class = c("", "", "frontV", ""), IPA = c("k", "tʃ", "i", "o"))
tokenize(c("cico", "coci", "coco"), profile = italian3, transliterate = "IPA", regex = TRUE)$strings

Output:

| originals | tokenized | transliterated
-|-----------|------------|------------------
1 | cico | c i c o | tʃ i k o
2 | coci | c o c i | k o tʃ i
3 | coco | c o c o | k o k o

took me a while to figure out what is going wrong here :-)

It turns out that the profiles have to be strings, not factors, so you should either add the option stringsAsFactors = FALSE:

italian1 <- data.frame(Grapheme = c("c","c","i","o") , Right = c("","frontV","","") , Class = c("", "", "frontV", "") , IPA = c("k", "tʃ", "i", "o") , stringsAsFactors = FALSE )

or use cbind instead of data.frame:

italian1 <- cbind(Grapheme = c("c","c","i","o"), Right = c("","frontV","",""), Class = c("", "", "frontV", ""), IPA = c("k", "tʃ", "i", "o"))

or you can use the newest version here on github, because I have added a safeguard into the code that this problem won't turn up again.

Thanks for the catch!