ARBML/klaam

Error opening training file, File contains data in an unknown format.

JihadZoabi opened this issue · 4 comments

Hi Ziad,
I tried running this script that is available in the readme file to the train the MSA model:

python run_common_voice.py --model_name_or_path="facebook/wav2vec2-large-xlsr-53" --dataset_config_name="ar" --output_dir=/path/to/output/ --cache_dir=/path/to/cache --overwrite_output_dir="yes" --num_train_epochs="1" --per_device_train_batch_size="32" --per_device_eval_batch_size="32" --evaluation_strategy="steps" --learning_rate="3e-4" --warmup_steps="500" --fp16="no" --freeze_feature_extractor="yes" --save_steps="10" --eval_steps="10" --save_total_limit="1" --logging_steps="10" --group_by_length="no" --feat_proj_dropout="0.0" --layerdrop="0.1" --do_train="yes" --do_eval="yes" --max_train_samples 100 --max_val_samples 100

And I got this message:

_Traceback (most recent call last):
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 511, in
main()
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 400, in main
train_dataset = train_dataset.map(
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 1955, in map
return self._map_single(
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 520, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 487, in wrapper
out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\fingerprint.py", line 458, in wrapper
out = func(self, *args, **kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 2320, in map_single
example = apply_function_on_filtered_inputs(example, i, offset=offset)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 2220, in apply_function_on_filtered_inputs
processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\datasets\arrow_dataset.py", line 1915, in decorated
result = f(decorated_item, *args, **kwargs)
File "C:\Users\user\PycharmProjects\pythonProject1\klaam\run_common_voice.py", line 394, in speech_file_to_array_fn
speech_array, sampling_rate = torchaudio.load(batch["path"])
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\torchaudio\backend\soundfile_backend.py", line 197, in load
with soundfile.SoundFile(filepath, "r") as file
:
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 629, in init
self._file = self._open(file, mode_int, closefd)
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 1183, in _open
_error_check(_snd.sf_error(file_ptr),
File "C:\Users\user\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.9_qbz5n2kfra8p0\LocalCache\local-packages\Python39\site-packages\soundfile.py", line 1357, in error_check
raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening '/path/to/cache\downloads\extracted\31455a499a0212b1751dd0c1547b0d360037f6a8c0a69178647a45a577d0ff67\cv-corpus-6.1-2020-12-11/ar/clips/common_voice_ar_19225971.mp3': File contains data in an unknown format
.

I think the reason behind it is that the training files are in .mp3 instead of .wav
Any suggestions to how I can tackle this problem?

Maybe you need to convert them first?

To be in the save side, I always convert to .wav when using this wav2vec model on an Audio dataset. I usually use pydub for this.

you can loop over your files, process each with

#https://stackoverflow.com/a/12391451/4412324
from pydub import AudioSegment
sound = AudioSegment.from_mp3("/path/to/file.mp3")
# export to the proper place
sound.export("/output/path/file.wav", format="wav")

Yes, I iterated over each file and changed its format to .wav
But now I am getting this error:

RuntimeError: Error opening '/path/to/cache\downloads\extracted\31455a499a0212b1751dd0c1547b0d360037f6a8c0a69178647a45a577d0ff67\cv-corpus-6.1-2020-12-11/ar/clips/common_voice_ar_19225971.mp3': System error.

And I think it's because the file common_voice_ar_19225971.mp3 doesn't exist. Now it's common_voice_ar_19225971.wav.
I changed the ending of the training files (from .mp3 to .wav) also in the tsv files (train.tsv, test.tsv, validated.tsv, invalidated.tsv, etc).

So now, it begs the question of why is the model looking for common_voice_ar_19225971.mp3 and not common_voice_ar_19225971.wav, and a possible explanation for that might be that the Arrow files for train.tsv, test.tsv, validated.tsv, Invalidated.tsv, still have the former ending (mp3).
Arrow files cannot be edited with a text editor, and the documentation doesn't explain how I can generate them according to the new tsv files or just edit them.

Of course, that's just a possibility, maybe there is something clear that I am missing.
What do you think?

Yes, I iterated over each file and changed its format to .wav But now I am getting this error:

RuntimeError: Error opening '/path/to/cache\downloads\extracted\31455a499a0212b1751dd0c1547b0d360037f6a8c0a69178647a45a577d0ff67\cv-corpus-6.1-2020-12-11/ar/clips/common_voice_ar_19225971.mp3': System error.

And I think it's because the file common_voice_ar_19225971.mp3 doesn't exist. Now it's common_voice_ar_19225971.wav. I changed the ending of the training files (from .mp3 to .wav) also in the tsv files (train.tsv, test.tsv, validated.tsv, invalidated.tsv, etc).

So now, it begs the question of why is the model looking for common_voice_ar_19225971.mp3 and not common_voice_ar_19225971.wav, and a possible explanation for that might be that the Arrow files for train.tsv, test.tsv, validated.tsv, Invalidated.tsv, still have the former ending (mp3). Arrow files cannot be edited with a text editor, and the documentation doesn't explain how I can generate them according to the new tsv files or just edit them.

Of course, that's just a possibility, maybe there is something clear that I am missing. What do you think?

Did you find a solution for that?

You should install a supported FFmpeg.

sudo add-apt-repository -y ppa:savoury1/ffmpeg4
sudo apt-get -qq install -y ffmpeg