common-voice/cv-sentence-extractor

Clean incomplete sentences

Mte90 opened this issue · 3 comments

Mte90 commented

Some examples:

La società "Discordia Ltd.
Basta provare".
Fermi tutti!".
Maschinenfabrik" a Amburgo.

What exactly is the expectation here?

If those sentences need to be cleaned up from the existing sentences from CV, then you can create a PR against the wiki file in the CV repo. Sentences which have already been recorded won't be deleted though.

You might also want to look at matching_symbols for the Italian PR you are currently working on.

Mte90 commented

I got those sentence with the WIP, the issue is that there is just one " so it is wrong as sentence.

Gotcha. I was wrong, for this you probably want to define even_symbols. Though matching_symbols probably would work as well, though more complicated for this case. See

even_symbols = ["\""]
for an example.