MycroftAI/mimic-recording-studio

question about the custom corpus

Opened this issue · 7 comments

I am looking to make my own corpus. I want to format it just like the english_corpus.csv like described, but am curious about the numbers. the english_corpus has like a phrase followed by a tab and a number. Do I need to worry about the number, and if so what does it represent?

Hi there @jjohnston7 great question!

The number represents the number of characters in the phrase. mimic trains based on the number of characters (as well as several other attributes).

For languages that have elongated vowel sounds, such as;

  • The elongated vowels with macrons in Māori
  • In Japanese the double vowel sound (in Romaji represented with a macron, like Tōkyō)
  • The long vowels و and ي in Arabic (among others)

then the long vowel should be counted as 2 characters. Otherwise the prosody of the trained language will be incorrect.

I can try to provide more guidance if you let me know what language or dialect you are training in?

Kind regards,
Kathy

Those values were generated programatically, so could easily be done for a corpus you create once you are done building the corpus.

OH, so I don't need to include those numbers in my CSV file?

Its English I'm doing, the speaker has an English accent...

At some point those numbers do have to exist, but there is opportunity to do it later just before the training starts and not with the creation of each sentence.

So, if you are looking at an English speaker, are you just adding to the english.csv? As long as you are working in the same language, you shouldn't need to worry too much about accent. That just emerges from the training off of a particular set of recordings.

I have a macro in excel that will tally the charactors for me. Do those numbers include things like spaces between words, commas, apostrophes, etc? or just letters? right now I have it counting everything except the period at the end of sentences...

Looking at https://raw.githubusercontent.com/MycroftAI/mimic-recording-studio/master/backend/prompts/english_corpus.csv, it appears the count is of all characters in the phrase -- including spaces, punctuation and the final period.

Hi all, seems like this conversation has run its course. If you have further queries please feel free to open another issue or reach out via the ~Mimic channel in Mycroft Chat.