Audio 1

Question

Audio 1

Opened this issue 2 years ago · 13 comments

I’m not great with expressing pronunciation in written form, and I don’t seem to have all the diacritical markings I would need available, so please bear with me.

General - Is there any way to increase the baseline range of the tones? Like making the highs a bit higher and the lows a bit lower? It would increase the feeling of animacy in the voices, which increases listener engagement. Additionally, just increasing the overall baseline tone range would make it easier for beginners to differentiate tones, which is super challenging for most beginners. As a separate matter, it may be worthwhile to exaggerate the tone range of certain words even more than the baseline range. Words such as greetings, adjectives, interjections, etc, may need to have an even more extended tone range, beyond the baseline, because a human voice is generally more animated in these types of words than in other parts of speech.
General - If I’m being honest, so far much of the audio doesn’t sound quite right to me.
OSIYO/SIYO - To me it sounds like it’s saying the highest tone in this word is the “si”, and has an down-up-down / inverted V shape to my ear. Like maybe it’s saying something like 2-3-2 or 2-4-32 (or similar). I don’t know, but a definite inverted V shape with “si” being the highest note. But when I hear “osiyo” in practice and in recordings like the audio of Durbin in SSW Lessons 19 and 20, I hear it as ascending tones, with the ending “yo” being the highest. Like maybe ‘long2-quick3-quick4’ or maybe a ‘long2-quick2-quick4’. I don’t know, but definitely “yo” being the highest note.
VV - This is a hard one. I think there are two issues. First, the sound itself (the v) doesn’t sound clear. Like, at 3:00 it sounds kinda like “vvowvvgh”, and at 3:06 I hear something like “vvowvvh”, almost like the sound changes between various letters within the word rather than staying a consistent “vv” sound that simply shifts pitch. Does that make sense? It makes it sound garbled. The “v” at 3:11 sounds somewhat clearer, though it still has a weird breathiness on the end, almost like “vgh”, “vth” or “vff” or something. The second issue is the tones of the first two “vv”s. On the one at 3:00, I hear it dip in pitch in the middle, kind of shaped like the character Ꮴ. Something like maybe ‘long3-quick2-quick32’. But I feel like the “vv” tones really need to be more like ‘quick2-quick32’. I hear it as having pretty much the same tones as ama /water, perhaps with a titch more of a downslur on the tail. Compare the tones in this audio https://www.cherokeedictionary.net/share/99686 with your audio at 3:00 and 3:06, you can hear the difference.
VHLA - On the audio, this sounds like it’s stepping up in tone. Like maybe something like “quick2-quick32’. Definitely going uphill. But doesn’t both “no” and “not” have a different downhill tone? I hear “vhla” as a ‘long4-quick2’. For reference, listen to Ed say it multiple times in this video from about 34:42 to 35:57. https://youtu.be/RikNifUTETQ Also, the one at 10:20 sounds to me like “ahla”, like it’s saying something’s laying there, not “vhla” / “no”.
AYO- This sounds likes it’s going downhill, with the highest tone coming first. Whereas I know “ayo” to be said in a strong uphill shape, like ‘quick1-quick4’ perhaps with a bit of trailing downslur. Being an interjection of pain/surprise, there’s a lot of distance between the tones and that 4 is real high. For reference, it’s said by Juksvsd in the Inage’i cartoon at about -20:07.
HOWA - Sounds fine to me.
TOHI/TOHIJU Some of the these sound sufficient. I would argue that they’d be better with the “ju” a bit higher, and there might need to be more distance between the low tone on the “to” and the extrahigh tone in the “hi”, and that as they are now the the “tohi”s and “tohiju”s sound really flat and uninterested, too monotone, not like how I hear it in real life. Greeting words and adjectives tend to sound more lively to my ear.

Edit- I was just thinking… is it possible that there’s a problem with YT or a problem on my end or something, since so many of these sound off to me? Has anyone else listened to them and given feedback? Any first language speakers?

Answer 1 · 2022-08-25T05:26:50.000Z

1. General - Is there any way to increase the baseline range of the tones?

Unfortunately, I really can't manipulate the TTS audio at all.

2. General - If I’m being honest, so far much of the audio doesn’t sound quite right to me.

I know the audio has issues, and I'm hoping to get the next version of the software and get better results in the future at some unspecified date. (It takes a bit of effort to "train" the TTS system.)

It doesn't help any that a good portion of the audio I use for the TTS training is from tape and has poor dynamic range.

Answer 2 · 2022-08-25T05:35:12.000Z

3. OSIYO/SIYO - To me it sounds like it’s saying the highest tone in this word is the “si”, and has an down-up-down / inverted V shape to my ear. Like maybe it’s saying something like 2-3-2 or 2-4-32 (or similar). I don’t know, but a definite inverted V shape with “si” being the highest note. But when I hear “osiyo” in practice and in recordings like the audio of Durbin in SSW Lessons 19 and 20, I hear it as ascending tones, with the ending “yo” being the highest. Like maybe ‘long2-quick3-quick4’ or maybe a ‘long2-quick2-quick4’. I don’t know, but definitely “yo” being the highest note.

Sounds like the word final high-fall isn't coming across correctly to you. I do know the C.E.D. entry shows o:síyo which indicates a higher tone on the sí. This again is caused by limitations I'm currently dealing with in regards to the TTS.

4. VV - This is a hard one. I think there are two issues. First, the sound itself (the v) doesn’t sound clear. Like, at 3:00 it sounds kinda like “vvowvvgh”, and at 3:06 I hear something like “vvowvvh”, almost like the sound changes between various letters within the word rather than staying a consistent “vv” sound that simply shifts pitch. Does that make sense? It makes it sound garbled. The “v” at 3:11 sounds somewhat clearer, though it still has a weird breathiness on the end, almost like  “vgh”, “vth” or “vff” or something. The second issue is the tones of the first two “vv”s. On the one at 3:00, I hear it dip in pitch in the middle, kind of shaped like the character Ꮴ. Something like maybe ‘long3-quick2-quick32’. But I feel like the “vv” tones really need to be more like ‘quick2-quick32’. I hear it as having pretty much the same tones as ama /water, perhaps with a titch more of a downslur on the tail. Compare the tones in this audio https://www.cherokeedictionary.net/share/99686 with your audio at 3:00 and 3:06, you can hear the difference.

Yeah. V:v has been a challenge to get pronounced even somewhat correctly. I've not found any combination of pronunciation markings to improve it beyond inserting a glottal stop between them. The TTS wants to add 'r' sounds to them. :(. If you think the V:v sounds too awful it can be removed complete from the audio exercises.

Answer 3 · 2022-08-25T05:38:02.000Z

5. VHLA - On the audio, this sounds like it’s stepping up in tone. Like maybe something like “quick2-quick32’. Definitely going uphill. But doesn’t both “no” and “not” have a different downhill tone? I hear “vhla” as a ‘long4-quick2’. For reference, listen to Ed say it multiple times in this video from about 34:42 to 35:57. https://youtu.be/RikNifUTETQ Also, the one at 10:20 sounds to me like “ahla”, like it’s saying something’s laying there, not “vhla” / “no”.

Ok, that is probably me hearing v̀:hla (low-fall) and not v́:hla (high) resulting in a bad transcription. I'll generate and share a new audio sample with the v́:hla and we can see how that sounds.

Answer 4 · 2022-08-25T05:40:24.000Z

6. AYO- This sounds likes it’s going downhill, with the highest tone coming first. Whereas I know “ayo” to be said in a strong uphill shape, like ‘quick1-quick4’ perhaps with a bit of trailing downslur. Being an interjection of pain/surprise, there’s a lot of distance between the tones and that 4 is real high. For reference, it’s said by Juksvsd in the Inage’i cartoon at about -20:07.

I pulled this one from a non-CED source, Beginning Cherokee I think, which means no tone marks were available. Based on your description of the sound, we should probably try ayő and not ayo to be its pronunciation. I'll generate an audio sample and share.

Answer 5 · 2022-08-25T05:41:27.000Z

8. TOHI/TOHIJU Some of the these sound sufficient. I would argue that they’d be better with the “ju” a bit higher, and there might need to be more distance between the low tone on the “to” and the extrahigh tone in the “hi”, and that as they are now the the “tohi”s and “tohiju”s sound really flat and uninterested, too monotone, not like how I hear it in real life. Greeting words and adjectives tend to sound more lively to my ear.

Yes, I'm hoping the next iteration of the software will allow more control over the "liveliness" of the speaker's audio.

Answer 6 · 2022-08-25T06:01:38.000Z

is it possible that there’s a problem with YT or a problem on my end or something, since so many of these sound off to me?

No, it is the TTS that is "off", most of the time not by much. But it's still "off". This is more prevalent in some combinations of phonemes and less so in other combinations and varies somewhat between voices. I've thought about adding a couple of other voices to the mix to try and help the learner learn the actual phonemes better.

I fear the only real way to improve the quality would be to have first language speakers sequestered in an audio booth reading long lines of sample sentences for recording. For many many hours of audio.

That being said, I feel it is sufficient to be usable training audio. Even if its imperfections result in poor pronunciations at times.

A challenge I've had as a student is hearing the language presented in a comprehensible fashion for understanding what I hear, and learning the general sounds (phonemes) for different words. I've previously mimicked other approaches that attempt to have the student utter words before they know how they sound with very disappointing results. I've went through this new approach doing a day's session daily and I think my vocabulary retention has improved dramatically (for the material covered) and being able to hear a word in my head makes it a lot easier to try and say said word.

Has anyone else listened to them and given feedback? Any first language speakers?

The only feedback I've gotten from first language speakers so far is a second (third?) hand one in the Cherokee Language tech chat group when I was first getting usable results from the TTS system. And I'm sure I would not find that reference looking into the chat history.

Answer 7 · 2022-08-25T06:10:03.000Z

Just as an FYI:

My goal is to get to the point that general spoken Cherokee in various materials is at least partially comprehensible.

One big challenge to comprehensibility is the long vs short forms for general speech. With the TTS system, it is feasible to include various shortened forms into the exercise materials to help with learning to hear the long forms mentally when hearing or seeing the short forms physically.

Answer 8 · 2022-08-25T07:23:10.000Z

3. OSIYO/SIYO - To me it sounds like it’s saying the highest tone in this word is the “si”, and has an down-up-down / inverted V shape to my ear. Like maybe it’s saying something like 2-3-2 or 2-4-32 (or similar). I don’t know, but a definite inverted V shape with “si” being the highest note. But when I hear “osiyo” in practice and in recordings like the audio of Durbin in SSW Lessons 19 and 20, I hear it as ascending tones, with the ending “yo” being the highest. Like maybe ‘long2-quick3-quick4’ or maybe a ‘long2-quick2-quick4’. I don’t know, but definitely “yo” being the highest note.

Sounds like the word final high-fall isn't coming across correctly to you. I do know the C.E.D. entry shows o:síyo which indicates a higher tone on the sí. This again is caused by limitations I'm currently dealing with in regards to the TTS.

You’re definitely right, for sure. I triple checked the way it’s written in the final CED, that other early draft of the CED, and also in Montgomery-Anderson. And you’re right, it’s definitely written like o:síyo. So I can’t figure out why it still sounds like o:síyő to my ear IRL, but it does. I went back and watched/listened to various trustworthy/speaker recordings, like Ed’s classes, a lesson from JW, various first speakers in interviews, the We Are Learning Cherokee audio, and it still sounds like o:síyő to me. Hmmm. I wonder what Meli and first speakers would say. What a mystery!

Answer 9 · 2022-08-25T07:49:25.000Z

Just as an FYI:

My goal is to get to the point that general spoken Cherokee in various materials is at least partially comprehensible.

One big challenge to comprehensibility is the long vs short forms for general speech. With the TTS system, it is feasible to include various shortened forms into the exercise materials to help with learning to hear the long forms mentally when hearing or seeing the short forms physically.

If it matters to your approach, I’m in the camp of people who believe that it’s important for learners to learn the full language first, before proceeding to shortened forms. I think knowing the full forms makes it easier to see the structure and logic of the language and constructions, and knowing the full form seems necessary to be able to most skillfully manipulate the language. A very simplistic example to illustrate this point would be, like… say I want to ask if a thingamabob is big. If I had only every heard the word “big” as “utan”, how could I add a question clitic without knowing the end vowel I need to re-attach before adding the yes/no question clitic? Well, I guess maybe I could say “jigo utan”, but nobody talks like that anymore. 😆 But hopefully you see what I mean. It seems exponentially easier, in my mind, to change over to the shortened form after first learning the long form, than vice versa. But that’s just me, and there’s more than one way to peel a banana.

Answer 10 · 2022-08-25T11:58:05.000Z

I've started a discussion topic on long vs short speech. See discussion entry #8.

Answer 11 · 2022-08-25T15:19:46.000Z

Please see tickets #5 and #6

Answer 12 · 2022-08-27T19:44:59.000Z

3. OSIYO/SIYO - To me it sounds like it’s saying the highest tone in this word is the “si”, and has an down-up-down / inverted V shape to my ear. Like maybe it’s saying something like 2-3-2 or 2-4-32 (or similar). I don’t know, but a definite inverted V shape with “si” being the highest note. But when I hear “osiyo” in practice and in recordings like the audio of Durbin in SSW Lessons 19 and 20, I hear it as ascending tones, with the ending “yo” being the highest. Like maybe ‘long2-quick3-quick4’ or maybe a ‘long2-quick2-quick4’. I don’t know, but definitely “yo” being the highest note.
Sounds like the word final high-fall isn't coming across correctly to you. I do know the C.E.D. entry shows o:síyo which indicates a higher tone on the sí. This again is caused by limitations I'm currently dealing with in regards to the TTS.
You’re definitely right, for sure. I triple checked the way it’s written in the final CED, that other early draft of the CED, and also in Montgomery-Anderson. And you’re right, it’s definitely written like o:síyo. So I can’t figure out why it still sounds like o:síyő to my ear IRL, but it does. I went back and watched/listened to various trustworthy/speaker recordings, like Ed’s classes, a lesson from JW, various first speakers in interviews, the We Are Learning Cherokee audio, and it still sounds like o:síyő to me. Hmmm. I wonder what Meli and first speakers would say. What a mystery!

Do you have to write in end/boundary tones for the system yourself? If so, how do you decide what the end tones are?
Or does the system generate the boundary tones itself? If so,how does the system decide which end/boundary tone is appropriate?

Answer 13 · 2022-08-28T18:04:26.000Z

Do you have to write in end/boundary tones for the system yourself? If so, how do you decide what the end tones are? Or does the system generate the boundary tones itself? If so,how does the system decide which end/boundary tone is appropriate?

The final high-fall tone is not marked. The system is supposed to learn it as a feature of the language from the training data.

The only places where I mark the final end tone is where it is clearly marked as as high-tone (acute), high-rising tone (double-acute), or a level tone using the macron for immediate present.