common-voice/cv-sentence-extractor

Is there a limit to the audio duration?

JJun-Guo opened this issue · 22 comments

Is there a limit to the audio duration?

Hey @JJun-Guo, recordings in Common Voice are currently limited to 10 seconds.

Here is a related recent discussion on allowing more:
https://discourse.mozilla.org/t/discussion-relaxation-of-the-10-sec-recording-limitation/114142

I need to check it from the code, but from my head, it was 1 sec but dropped to 0.5...

Actually, as it also includes silences, short uttrences can easily be recorded putting a silence at the start or at the end while recording.

I was wrong. It is 1 sec. 0.5 sec is for the benchmark sentences (numbers etc).

https://github.com/common-voice/common-voice/blob/3bccdf446f6acd8a9afda1db7a9a1664457e611d/web/src/components/pages/contribution/speak/speak.tsx#L42

But as I stated on the link given in the previous post, state-of-the art models work better with longer utterences. E.g. whisper best works for 5-25 sec recordings...

So, it is better to get an average char duration and calculate a minimum sentence length from there...

AFAIK, a rule-of-thumb is to train a model with data which it will see in the wild. For a general purpose ASR model where the model is subjected to everyday speech, I think it should include shorter ones, because spontanous speech/conversations include them extensively, like in short answers to questions: yes-no-ok-fine-etc, "What do you want?" => "Tea..." like...

I think it is best to have a more-or-less evenly distributed durations (flat curve), thus sentence lengths. One could work on the betterment of their Common Voice dataset to remedy peaks in the distribution.

I created webapps where people can examine their datasets in more details, also helping in this area - for all CV languages.
For example, this is the duration distribution of CV 13.0 Turkish validated recordings:

image

And this is the distribution in text corpus:
image

Because we had little CC0 sentence resources, we had to rely on volunteers writing common everyday stuff, which are short and dropped the average recording duration to 3.6 - from around 4 secs. We need to remedy this issue...

You can check your language from here:
https://analyzer.cv-toolbox.web.tr/

You can also check the overall changes in time here:
https://metadata.cv-toolbox.web.tr/

If you are working on the cv-sentence-extractor rules (first run):

Getting longer sentences are better I think. It is easier to get shorter sentences from other sources. Once it gets data from an article, it is done.

Some points on this:

  • As stated, state-of-the-art models train better starting with 5 secs.
  • Most languages in CV have an average of 4-6 secs.
  • A longer sentence will result in a longer duration recording.
  • And finally what matters is the training/fine-tuning train set duration.
  • Instead of getting 3000 sentences from 1000 articles with average 4 sec, if you take so that the average is 8, the duration will double.

Instead of getting 3000 sentences from 1000 articles with average 4 sec, if you take so that the average is 8, the duration will double.

Not wrong, but might be risky without proper testing. Note that if the Sentence Extractor can't find 3 sentences with the required length, it will not continue to try with less words, it will just use what it got and continue on to the next article. Of course with proper analysis of the source it would be possible to fully optimize this.

@MichaelKohler, can this be made adaptive? I mean, not to put an absolute minimum, but set a "requested_minimum", if the 3 sentences are not found, fill it with shorter ones...

Yes, certainly would be an option, but that would need to be implemented. Overall this would mean going over the sentences multiple times for the case where it won't find enough sentences the first time, but probably not such a big hit on performance overall. In the end, for development purposes that won't matter and for the final run it's fine as well as that runs in the GitHub Action.

As you know working on this was on my to-do list, if only I can get really good results... I'll look into this. E.g sorting sentences by length can help performance.

sorting sentences by length can help performance.

Mh, this made me think. Now I wonder if the legal requirement is just "maximum 3 sentences per article" or if there could be issues if we always pick the 3 longest sentences. In some articles the longest 3 sentences might be the majority of content. Probably something that would need to be verified just to make sure. To be clear: I only ever knew about the "maximum 3 sentences per article" without any further restrictions, but I can't guarantee that this is exactly what the lawyers said.

Very good point... But this is how it works now, isn't it? So, as of now, if an article has 3 sentences, they are taken if the rules match.
One could add a check for it so that the char count of the selected (verified by rules) sentences is at most (say) 50% of the total for example.

Right now it's fully random, but rejecting what does not fit the rules. So generally, by analysis the full Wikipedia dump, you could optimize the minimum words rule to get the most words out. But that would be different than always taking the longest sentences.

Of course depending on the requirements additional rules can be added. At this point I don't even know if it would be a problem or not to do it that way.

As I mentioned above, with the state-of-the-art models and HW advancements, it is better to get longer audio, thus longer texts. A change in this repo towards this goal would be awesome. Especially because there is no going back once 3-4 word sentences are taken...

With longer sentences, duplicates/similarities will also drop substantially, and more possible vocabulary will go into the text-corpus. I think more common words are already in the corpora or can easily be added from other sources, but less frequent ones will be needed by everyone (if too-technical/problematic/hard-to-read ones got correctly ruled out).

If it is legally possible of course...

@jessicarose Analog to the other question I tagged you in, could you also check here if we in theory would be allowed to always take the 3 longest sentences per article? Thanks!

Sorry to ping the issue...

I'm nearly finalizing my work and I need to ask if taking the longest three sentences will ever be possible - because there is no going back.

Is there a limit to the audio duration?

@JJun-Guo, the recording limit is increased to 15 seconds in Common Voice v1.114.2.

@MichaelKohler: Probably all rule files should adapt to this change, including the defaults.

@HarikalarKutusu Thanks for keeping track of this. I agree. Do you know what the correct value for EN would be and then we set that as default? And do you have time to reach out to all language contributors to get a new estimate? I'd be fine with one PR updating all the values as I think it's rather low-risk of a change. One thing to note is that some languages use characters and some use words.

@MichaelKohler I think a 50% increase should be fine for both max words and characters.
The problem is with the minimums. The new Common Voice "guideline" suggests 10-15 sec recording times on par with the newer model architectures.

With the new v17.0, I can add some character speed measurements and possibly per user, and their distribution in the Analyzer, so that one can for example see the 95 percentile coverage from those values. But that part should be handled by communities like you suggest.

For those languages which already did run the cv-sentence-extractor, most probably already got shorter sentences and might like to increase that limit for re-runs, also taking into account the recently introduced rules.

I have time for PRs and posts in Discourse, but you might need to point to them in case somebody decides on a re-run...

but you might need to point to them in case somebody decides on a re-run...

I can try to keep this in mind :)