common-voice/cv-sentence-extractor

Fix Extraction for Belarusian (and possibly others)

MichaelKohler opened this issue · 1 comments

In #118 we discovered that the automatic extraction on Pull Requests fails for Belarusian:

https://github.com/Common-Voice/cv-sentence-extractor/pull/118/checks?check_run_id=868518093

file_name = "/home/runner/work/cv-sentence-extractor/cv-sentence-extractor/text/AA/wiki_46"
Error: "stream did not contain valid UTF-8"

I've re-triggered the job several times, it always failed in a different file. Happy to help out if somebody wants to debug that!

Additional info:

  • The author of the PR says it works locally, I didn't have time to check that on my machine
  • It's very likely that the export on merge will fail as well due to that issue