pytorch/audio

Windows Support

vincentqb opened this issue ยท 26 comments

To bring Windows support with mp3 support, we need

  • Activate build for wheels and conda package on CircleCI for Windows without SoX, see #394
  • Activate SoX tests only when SoX available, see #419
  • Fix kaldi_io NameError for Windows in comment.
  • Activate CircleCI tests on Windows (e.g. #493)
  • Implement per-file-format backend dispatch mechanism.
  • Add minimp3 for mp3 support, see comment.
  • Activate all tests that depend on mp3 support.
  • Activate nightly upload for wheels and conda package, e.g. #385.
  • If MinGW is supported by pytorch, then fix #42.

If and only if no backend support mp3 on Windows after the above:

  • Compile SoX on Windows
  • Activate build for wheels and conda package on CircleCI for Windows with SoX usinc cpp extension, e.g. #394.

Closes #50, closes #219, closes #258.

cc @peterjc123, @chauhang, pytorch/pytorch#24344

The kaldi_io test is passing on Windows now. BTW, I think it's hard to compile Sox on Windows. Other things sound reasonable to me.

Thanks for the input. Can you share the output of CircleCI where the kaldi_io tests are passing?

If SoX is not possible to compile on Windows, we'll need to identify an alternative backend that offers similar file support on Windows: mp3, flac, wav, at least. soundfile unfortunately doesn't support mp3. See e.g. comparison.

Thanks for the input. Can you share the output of CircleCI where the kaldi_io tests are passing?

Sure. It was posted here: #419 (comment).

If SoX is not possible to compile on Windows, we'll need to identify an alternative backend that offers similar file support on Windows: mp3, flac, wav, at least.

What about this one? https://github.com/beetbox/audioread or https://github.com/librosa/librosa?

What about this one? https://github.com/beetbox/audioread or https://github.com/librosa/librosa?

aubio seems to perform better than librosa, according to this, and supports more format than audioread. Thoughts?

Well, it looks good to me except its package on pypi is a source package. However, if we use the C/C++ part then we should be okay.

Well, it looks good to me except its package on pypi is a source package. However, if we use the C/C++ part then we should be okay.

What is the implication of a source package?

As you can see from https://pypi.org/project/aubio/#files, only the file ends with .tar.gz is available.

What about this one? https://github.com/beetbox/audioread or https://github.com/librosa/librosa?

Both seems good then. Let's go for audioread then, since it appears to be faster than librosa. I've updated the description above to reflect the choice of audioread over sox for windows.

Have you looked into pydub? https://github.com/jiaaro/pydub

I've been using it on windows, and it works great for mp3 and wav files. The installation is a bit involved since it requires the user to add ffmpeg to the environment path?

@vincentqb Just some small remarks...

For the various use cases of audio i/o there are two scenarios where loading is used within torchaudio:

  1. Training

Here, loading and decoding performance is crucial and easily becomes the bottleneck of dataloaders that deal with raw audio. Typically expensive compression formats should be avoided and simple formats such as wav, flac and mp3 should be used instead. Furthermore seeking support is crucial to load chunked audio from original (larger tracks)
In this use-case we already have libsndfile, interfaced with pysoundfile that cover wav and flac (at one point it would make sense to directly interface libsndfile to avoid numpy). Regarding MP3 support (+windows) I just discovered minimp3 that ticks all boxes. Also it is ridiculously fast and therefore could easily be the best tradeoff between loading and decoding speed.

  1. Inference

Here, performance is not that crucial but support for various formats such as m4a/mp4/aac would be beneficial. As we often discussed in torchaudio-contrib, I still don't see any way around ffmpeg. ;-)

To sum up, I don't think it make sense to add another python package for audio i/o and instead focus on more low level and faster alternatives such as minimp3 that also come with less dependencies. What do you think?

In this use-case we already have libsndfile, interfaced with pysoundfile that cover wav and flac (at one point it would make sense to directly interface libsndfile to avoid numpy). Regarding MP3 support (+windows) I just discovered minimp3 that ticks all boxes. Also it is ridiculously fast and therefore could easily be the best tradeoff between loading and decoding speed.

@faroit -- Have you run your benchmark with minimp3? I'd love to see how it compares.

You are suggesting having a mix of backend for different format? That could be an option, yes. However, the context of this particular pull request is to make torchaudio available on Windows with the same features as the other OSs supported, and so this particular pull request doesn't push the boundaries of speed :)

  1. Inference

Here, performance is not that crucial but support for various formats such as m4a/mp4/aac would be beneficial. As we often discussed in torchaudio-contrib, I still don't see any way around ffmpeg. ;-)

To sum up, I don't think it make sense to add another python package for audio i/o and instead focus on more low level and faster alternatives such as minimp3 that also come with less dependencies. What do you think?

I agree that there are already many python libraries loading audio files. In particular, those that load into numpy can be then used to load into pytorch, since pytorch can convert tensors from/to numpy at no cost. This means most users that want some very specific audio file can already do so.

It is still convenient for the users to get support for some common audio file format directly in torchaudio. But we can focus on the most critical format (wav, flac, mp3), and support them well and fast.

In that context, since ffmpeg is a heavy dependency, I would avoid depending on it for as long as I can. :)

@vincentqb Actually both audioread and aubio relies on ffmpeg.

Ah, good point. Has any of you faced any challenges such as this installing audioread? If not, I'd say we move forward anyway.

By the way, torchvision is also moving toward ffmpeg for video.

@cpuhrsch -- You voiced not being in favor of ffmpeg in the past. Any comments?

@vincentqb It will be easy for conda users because they can simply do conda install -c conda-forge ffmpeg. To make it convenient for other users, we may just distribute the DLLs for them.

@vincentqb BTW, users can only read a file using audioread, but not write. If we want to create a new backend like sndfile and sox, we'd better choose something else.

Let's list the requirements for a backend:

  • Easy installation with torchaudio for the user in windows (for this PR).
  • Read wav/mp3/flac whole files, or chunks at specified location of a file.
  • Write wav/mp3/flac whole files.
  • Optional: Perform well in this benchmark.

@peterjc123 -- Please do let me know if I forget anything in this list. Do you know any other backend that would work well with those criteria?

@vincentqb

Have you run your benchmark with minimp3? I'd love to see how it compares.

There is no functional python/numpy interface yet โ€“ see status of pyminimp3, so I used the implementation recently added to tf.io. The performance looks incredible:

benchmark_tf

(ar_ffmpeg is audioreads ffmpeg interface)

@vincentqb @peterjc123

Sorry for hijacking this thread.

In that context, since ffmpeg is a heavy dependency, I would avoid depending on it for as long as I can. :)

I totally agree with you. FFMPEG is going to painful. But I don't think there is any other alternative to support a large number of formats.

That's why I think we should have some fast decoder-only alternatives for a limited number of formats (useful for training). I am still in favor of removing sox and just go sndfile/minimp3 for this scenario. Then ffmpeg for writing and everything else where loading speed in not an issue.

On ffmpeg, I'd like to add the idea that, in general, we want backends to be opt-in.

By default we should pick a light library that works for most common formats and then allow the user to switch to different backends (such as ffmpeg) for either performance or features.

Figuring out how to setup this backend dispatch mechanism could probably resolve many of the discussions here. Essentially we want to have load and save dispatch to a different backend depending on file-format and the user's settings.

The simplest approach is to make a choice at compile-time. We're already beyond that with our global run-time backend mechanism.

A more granular approach is to then allow users setting different backends for each file format.

Then beyond that we can even introduce preferred orders per fileformat based on available formats (e.g. use specialized library X over Y when available, but transparently default to Y otherwise).

Right, although the current choice for globack runtime backend dispatch, we do not support mp3 for windows. One option is to switch default global backend to something that also supports mp3 for windows. Another is to add a file-format-dependent dispatch.

The former would favor going all-in with ffmpeg. The latter favors minimp3.

Based on feedback above from @faroit and @cpuhrsch, the latter is preferred as the next step. I'm good with that conclusion, so I'll update the todo/description above to reflect that.

@vincentqb I saw a post that describes how to compile torchaudio with Sox. Will try that later.

Torchaudio with Sox: #648

mp3 for windows without sox in #1000

@vincentqb if you want also support writing MP3s on Windows, I would recommend https://github.com/chrisstaite/lameenc

I have been using it for a while inside demucs, and it is amazing (in the sense that it is small, no extra dependencies, and works perfectly with just a pip install on all OSes). At the moment though it seems their build for python3.9 is broken...

thanks for the input :)

Hi there, I see that ffmpeg and sox are issues for this library. I want to let you know that I've solved these exact problems for tools like this so that these binaries can be easily deployed for Mac/Win/Linux.

Please see:

https://github.com/zackees/static-ffmpeg
https://github.com/zackees/static-sox

Using tools like ffmpeg will allow you to write mp3's with minimal code and have it work everywhere. I recommend using static_ffmpeg.add_paths(weak=True) and static_sox.add_paths(weak=True).

These python packages are available through pip as well so can be included in your dependency management. The binaries are only downloaded when they are first used. By specifying weak=True the libraries will only download ffmpeg/sox if the binaries don't already exist on the system.