Can not load commonvoice dataset on windows

Question

Can not load commonvoice dataset on windows

jacobjennings opened this issue 9 months ago · 1 comments

🐛 Describe the bug

When loading the common voice dataset on windows, the file train.tsv is loaded using cp1252 file encoding, leading to a failure.

training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory)

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
Cell In[49], line 1
----> 1 training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory)

File ~\Documents\GitHub\clarification\venv-pc\Lib\site-packages\torchaudio\datasets\commonvoice.py:55, in COMMONVOICE.__init__(self, root, tsv)
     53 walker = csv.reader(tsv_, delimiter="\t")
     54 self._header = next(walker)
---> 55 self._walker = list(walker)

File ~\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
     22 def decode(self, input, final=False):
---> 23     return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3155: character maps to <undefined>

Versions

Python 3.11

Answer 1 · 2024-05-03T12:22:03.000Z

You can try to download it from hugging face:

https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0