Can not load commonvoice dataset on windows
jacobjennings opened this issue ยท 1 comments
jacobjennings commented
๐ Describe the bug
When loading the common voice dataset on windows, the file train.tsv
is loaded using cp1252 file encoding, leading to a failure.
training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory)
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Cell In[49], line 1
----> 1 training_speech_dataset = torchaudio.datasets.COMMONVOICE(root=base_dataset_cache_directory)
File ~\Documents\GitHub\clarification\venv-pc\Lib\site-packages\torchaudio\datasets\commonvoice.py:55, in COMMONVOICE.__init__(self, root, tsv)
53 walker = csv.reader(tsv_, delimiter="\t")
54 self._header = next(walker)
---> 55 self._walker = list(walker)
File ~\AppData\Local\Programs\Python\Python311\Lib\encodings\cp1252.py:23, in IncrementalDecoder.decode(self, input, final)
22 def decode(self, input, final=False):
---> 23 return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 3155: character maps to <undefined>
Versions
Python 3.11
mogwai commented
You can try to download it from hugging face:
https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0