Update on TorchAudio’s future
scotts opened this issue · 23 comments
Dear TorchAudio users,
TorchAudio is the most popular audio library for PyTorch. It has critical transforms, models and datasets that we know the community relies on. That is why we wanted to let the community know that we have started a refactoring effort to transition TorchAudio into a maintenance phase. This process will involve removal of some user-facing features. We have three goals we want to achieve with this effort:
- Make TorchAudio easier to maintain to ensure long-term reliability. We plan to eliminate all C++ code so that TorchAudio is a Python-only library. We also plan to reduce external dependencies as much as possible. Both efforts will simplify testing and release.
- Reduce redundancies with the rest of the PyTorch ecosystem. Some of the functionality in TorchAudio is also available in TorchVision and TorchCodec. We are working across all three libraries to ensure a given capability lives in one library.
- Focus on TorchAudio’s strengths. Those strengths are the audio transforms, models and datasets that are integral to users training and inference pipelines. As a result, we will deprecate and eventually remove some functionality that is outside of these strengths.
The diagram below depicts the various components of TorchAudio. We have highlighted it according to the user-facing API changes that we are making:
Starting with TorchAudio 2.8 (expected around August 2025), APIs slated for removal will trigger a deprecation warning. These APIs will be fully removed in TorchAudio 2.9 (anticipated by the end of 2025).
Most of the APIs in transforms, functional, compliance.kaldi, models and pipelines modules will remain. These are the APIs that we identified as the most popular and valuable ones.
- A few APIs, specifically those relying on C++ implementations like RNNT loss and forced-alignment, may be dropped. Some, like
lfilterandoverdrive, will switch to pure-Python implementations, which might affect performance. We are exploring options to retain C++-backed APIs, but this is unlikely. - Remaining APIs will be compatible with the latest stable PyTorch version. No new features will be added.
The decoding and encoding capabilities of TorchAudio for both audio and video data will migrate to TorchCodec, where we are consolidating all of PyTorch media decoding and encoding. TorchAudio’s decoding and encoding APIs will be deprecated from TorchAudio 2.8, and they will be removed in TorchAudio 2.9, so we encourage users to migrate to TorchCodec as soon as possible. TorchCodec already supports video and audio decoding, and encoding will be supported soon. While there isn't a direct 1:1 API mapping, the migration process should be smooth. Please report any issues in the TorchCodec repository.
All other modules and APIs will be removed in TorchAudio 2.9.
We understand that these changes may be disruptive. We believe that they are unfortunately necessary, in order for us to guarantee TorchAudio’s stability in the future.
[EDIT from @NicolasHug] We'll also be removing torchaudio from the official installation instructions, starting from the 2.8 release.
Hi @scotts, thanks for reporting the status of torchaudio and future plans.
I don't understand the decision to drop the C++/CUDA extensions...
They are implemented because of the super inefficiency if they're done in pure Python (with JIT compilation).
Just like you said, Torchaudio's strength is its various audio transforms.
Thus, they should be kept instead of removed.
Switching back to pure-Python implementations is like going backwards and makes no sense.
These low-level implementations enable a state-of-the-art training speed compared to other libraries. (check out torchaudio 2.1 ASRU paper.)
The lfilter has recently been used in torchfx as the low-level operator for differentiable and fast filtering on GPU.
They're valuable to the community, and the decision to drop them is unwise, disruptive, and disastrous.
There should be more discussions on this before making the decision.
I suggest holding back this decision.
Best wishes,
Chin-Yun
PhD student
Centre for Digital Music
School of Electronic Engineering and Computer Science
Queen Mary University of London
Email: chin-yun.yu@qmul.ac.uk
@scotts thanks for the update!
Removing the C++/CUDA extensions is a big step backwards for the community and makes some of the implementations essentially useless due to their slow Python-only versions. I understand some concessions must be made if PyTorch Audio is no longer going to be actively developed, but I would also highly encourage reconsidering the removal of the C++ extensions, at least for the most popular operators.
Thanks!
About lfilter, it would be nice to match the scipy precision and behaviour. I understand in big pictures but a lot of work because of this.
@yoyolicoris, @christhetree, thanks for taking the time to reply. I understand that removing C++ implementations may be a performance regression for those components. I would like to further explain the motivation for why removing this C++ code specifically improves the long-term health of TorchAudio:
- C++ compilation complicates testing. Because we need to use different C++ compilers in the cross product of all supported platforms (Linux, Windows and Mac), architectures (x86, arch64) there's much more chance of breakages. A Python only repo reduces the testing matrix down to just platform and Python version.
- C++ binaries complicates release. Each entry in the cross product of platforms, architectures, device and Python version requires a separate wheel. Because of this, we can see that the "TorchAudio 2.7 release" is actually 109 wheel files. A Python only repo reduces that down to the same number of wheels as supported devices, which I think would be just 4.
- The Torch C++ API is not ABI-stable, and all libraries that use the C++ API must release with each new version of PyTorch. This means that point 1 and 2 must be dealt with on the regular PyTorch release cadence which is roughly every 3 months.
In the update, we did say: "We are exploring options to retain C++-backed APIs, but this is unlikely." Specifically, that exploration is if we can take advantage of a new effort in PyTorch 2.7, which is a stable ABI. That only addresses point 3, but addressing point 3 could greatly reduce the cost of point 2. The cost of point 1 would still stand, though. For those interested in retaining various C++ components, let us know if you have the capacity to explore porting these components to the stable ABI. That changes the maintenance cost equation.
- Make TorchAudio easier to maintain to ensure long-term reliability. We plan to eliminate all C++ code so that TorchAudio is a Python-only library. We also plan to reduce external dependencies as much as possible. Both efforts will simplify testing and release.
Maybe for some other C++ components, the model could be to factor them out in separate repo which doesn't provide binaries releases and supports only some GitHub Actions CI for testing and relies on users who must build it themselves
Also, for some C++ code, maybe load_inline(...) method can be used / improved: https://pytorch.org/docs/stable/cpp_extension.html#torch.utils.cpp_extension.load_inline for simplifying build scripts. Like so - the user would be responsible for having the working toolchain, and binaries would be built on the enduser's machine
Also, maybe a way forward would be to convert some C++ code to pure C API (e.g. could work for ffmpeg effects), to be called via ctypes (and use DLPack API or pure pointers for passing tensors for processing). This should eliminate the problem of unstable PyTorch C++ ABI.
Regarding ffmpeg effects, maybe they could also be moved to torchcodec, as working with ffmpeg filter chains would be a very useful feature...
Another useful component in torchaudio are bindings to flashlight, but flashlight itself is discontinued for several years now. So probably the best path there would be factoring out flashlight C++ code + python bindings in torchaudio in a new standalone repo like Nvidia did: https://github.com/nvidia-riva/riva-asrlib-decoder . This is already half-done into https://github.com/flashlight/text, but would be nice to maybe move the Python bindings https://pytorch.org/audio/0.12.0/models.decoder.html next to it? Also, given that Flashlight itself is discontinued, maybe worth moving the decoder out of the Flashlight org? to the pytorch org?
Thank you for sharing this.
I respect and love what you guys are doing, but you're treating Python like it's not Python.
You already know that this means most of the library's APIs are going to be tens of times (if not hundreds of times) slower and more inefficient by all measures... Dropping C++ is not worth it here, it's not possible to match the performance with Python. To be fair, it's fast because it's not really Python code.
Thanks for all the efforts,
I hope you refine your plans for TorchAudio at least to some extent.
@scotts Also, might be interesting to promote some of stable signal processing functions / modules into PyTorch core (e.g. new torch.signal namespace akin to https://docs.scipy.org/doc/scipy/reference/signal.html)?
Another solution might be:
- moving all python-only models/code to HuggingFace
- moving mature functions/transforms to core pytorch
- moving all other C++ extensions to use
torch.utils.cpp_extensions.load_inlineornvrtc(via https://github.com/NVIDIA/cuda-python) - maybe when possible - let go of using libtorch /
torch::Tensorinterface and replace it with DLPack interfacing or raw pointers, this would make user-side compilation very stable
Hi all, here's a quick update, as we just published TorchAudio 2.8.
Deprecated APIs
Most APIs marked as "Drop" above are now explicitly deprecated, raising deprecation warnings in the docs, and when using them from Python. They will be removed in the next 2.9 version.
Migration of load() and save() to TorchCodec
As we mentioned, we are consolidating the decoding and encoding capabilities of PyTorch in TorchCodec.
torchaudio.load() and torchaudio.save() are some of the most popular TorchAudio APIs, so for convenience we are providing torchaudio.load_with_torchcodec() and torchaudio.save_with_torchcodec(), which can largely be used as drop-in replacements. However, we do encourage users to directly migrate to TorchCodec's AudioDecoder() and AudioEncoder().
In future versions, torchaudio.load() and torchaudio.save() will still exist, but their underlying implementation will be relying on torchaudio.load_with_torchcodec() and torchaudio.save_with_torchcodec().
We hope for this migration to be as smooth as possible - most users should just need to pip install torchcodec, and things should still work as-is.
TorchCodec doesn't support Windows yet, but we're working hard on it. Please bear with us.
C++ and CUDA extension
We mentioned that we were exploring options to retain the C++-backed APIs, which are currently slated for deletion. Specifically: forced_align, lfilter, overdrive, RNNT, and CUCTC.
While this isn't something I can assert with 100% certainty, we are now more confident that we'll be able to preserve these extensions by porting them to Pytorch's new "stable ABI" operators. We are actively working on it.
Nicolas
Would it be another alternative to somehow convince core to take in all the C++/CUDA ops from torchaudio? (like some CTC impl is already in core, and lfilter might be the basis for new torch.signal namespace drawing from scipy.signal) :)
This would radically simplify build process of torchaudio and can make it Python-only
Given that the development of torchaudio is not increasing, could it be a good way forward?
I guys i am very glad to say what you have a great funy product, simply full stack with an interface with all the hugdeface model to do great apps even for personnal need, and for that i very please to say you do a great job !!! But now the question is reability and resiliance of the product could please at least doing a realese report with which up a runing and which is in dev !!!!
I spent 3 months of test with good test play with traduction and emotion implement and now nothich works !!!
Even with openAI and the help of the reference action and code.
So i a want to make voice translation with a refence wav as I do 1 moth ago that il the version of stable module python ?
Best regards
core to take in all the C++/CUDA ops from torchaudio?
This is something we've considered. It's true that it would simplify torchaudio's side, but it would offload some the cost and debt onto core, so it's not a simple decision to make. At the moment, we think that porting the ops to the stable ABI is our best bet.
Here from the deprecation warning.
Would appreciate an alternative that allows retaining some of these ops going forward (I'm developing an alignment library that is using forced_align).
What does it imply in practical terms that these will be ported to Pytorch's "stable ABI"? Will users be able to keep using them via Pytorch? Sorry for asking what may be an obvious question, but I'm just not knowledgeable about what it means for something to be ported to ABI.
This platform is a buggy disorganized mess. Why would you be deprecating functionality that doesn't yet exist elsewhere (e.g., Windows support for TorchCodec) or is now much slower (e.g., removing CUDA extensions without a suitable replacement)?
When my program runs:
print(f"{torchaudio.list_audio_backends()}") []
The response is now:
UserWarning: torchaudio._backend.list_audio_backends has been deprecated. This deprecation is part of a large refactoring effort to transition TorchAudio into a maintenance phase. The decoding and encoding capabilities of PyTorch for both audio and video are being consolidated into TorchCodec. Please see https://github.com/pytorch/audio/issues/3902 for more information. It will be removed from the 2.9 release.
So what code should I use instead if the sound file backend is not recognized?
import warnings
warnings.filterwarnings("ignore", message=r".*(maintenance phase|TorchCodec).*") # RIP torchaudio
This will break legacy projects that rely on torch>2 since the expectation is non-breaking changes without going to torch 3. It will also create a need to detect APIs or support either pre-or-post refactor pytorch. Meaning that one project which could previously be used with 2.0 to 2.7 will now have to choose between 2.8 or <2.8 support. Therefore limiting the ability for let's say, Matcha TTS to be reused in new models that want to use Torch 2.8.
Meanwhile, new CUDA and new GPU support is likely to push new versions of pytorch to be required.
What is the viability of building old TorchAudio versions, such as 2.7.0 with new Pytorch, such as 2.8.0? It might involve setting the torchaudio to a fake version, i.e. 2.8.0 (while actually having 2.7.0's code) for ease of installation.
This will break legacy projects that rely on torch>2 since the expectation is non-breaking changes without going to torch 3. It will also create a need to detect APIs or support either pre-or-post refactor pytorch. Meaning that one project which could previously be used with 2.0 to 2.7 will now have to choose between 2.8 or <2.8 support. Therefore limiting the ability for let's say, Matcha TTS to be reused in new models that want to use Torch 2.8. Meanwhile, new CUDA and new GPU support is likely to push new versions of pytorch to be required.
TBH I doubt they care. A few years ago they randomly dropped complex32 support for STFT... Over years APIs have been changing a bit erraticaly. I believe torchaudio team is understaffed.
Just compare installing jax where you have jax for cpu and jax[cuda_XX] for cuda versions.
Pytorch pretty much needs a shell script to install the correct version. In windows it installs pytorch-cpu by default, but in linux it install pytorch with cuda by default. It's a mess.
Are you, at least, going to propose alternatives to the deprecated APIs?
For example, forced alignment?
https://docs.pytorch.org/audio/stable/generated/torchaudio.functional.forced_align.html#torchaudio.functional.forced_align
@empz the current status is still #3902 (comment). In all likelihood, we'll be able to preserve forced_align and the other C++ / CUDA operators of torchaudio.
@empz the current status is still #3902 (comment). In all likelihood, we'll be able to preserve forced_align and the other C++ / CUDA operators of torchaudio.
Oh really? That's great to read.
The doc page still says it's deprecated and going to be removed in 2.9 though.
Yes, and it will still say something along those lines the 2.9 version that we'll publish in the next few weeks. We'd rather be overly-pessimistic and raise a warning about a deprecation that may eventually not happen, rather than delete something without a warning.