JonathanFly/bark

[Request] Support DirectML for Windows AMD GPU users

Milor123 opened this issue · 22 comments

Hi Guys!! I would like use this on AMD, guys how can I use this project using DirectML for use my RX 6700XT instead CPU.
Could u help me to do the port, i could be a tester of your develop branches. Thank u very much ❤️‍🔥

Hi Guys!! I would like use this on AMD, guys how can I use this project using DirectML for use my RX 6700XT instead CPU. Could u help me to do the port, i could be a tester of your develop branches. Thank u very much ❤️‍🔥

I don't have access to an AMD GPU but if you are willing to be the guinea pig and test for me, I could take a crack at it. Maybe late weekend, like Sunday ish.

Greeeattt !!! Love u !!! Yep, its good for me, I have much time! 😄 The weekend is perfect!
Did you have Telegram or discord? or write here?

I hang out in the Bark official Discord all the time, same name JonathanFly, you an DM me there. Link is here: https://github.com/suno-ai/bark

I hope you are making progress with AMD support. I kinda need to dub my game :-)

I make enough progress to know it's pretty tricky. But it should be easier soon, the Bark model is about to be ported to Huggingface Transformers.

If you check here: https://github.com/huggingface/transformers/pull/24086 you can see they are making good progress. As soon as they are done I think I can support AMD. Edit: Oh I forgot github just literally puts a bit notification into any thread you point to. Hopefully if I edit it will go away....

Thanks for your efforts. i am simply too poor for a new Nvidia graphics card and stay with AMD^^ But it is a great way to give a voice to cheap NPCs in the game.

Thanks for your efforts. i am simply too poor for a new Nvidia graphics card and stay with AMD^^ But it is a great way to give a voice to cheap NPCs in the game.

I very badly got it working in DirectML, I'll post update soon. On my 3090 it's only a bit faster than CPU, so not sure it's gonna help much. But I do have 16 core CPU and using the DirectML version is as fast, using just one core, + the GPU. I didn't fix it I just made any torch functions that didn't work, use CPU numpy instead.

Thanks for your efforts. i am simply too poor for a new Nvidia graphics card and stay with AMD^^ But it is a great way to give a voice to cheap NPCs in the game.

Can you try this?

https://github.com/JonathanFly/bark/tree/bark_amd_directml_test#-bark-amd-install-test-

I don't know if it works on AMD, or if it does, if it's any faster than CPU. But it might be?

Brother, you are my hero^^ It works :-) Yes, it is slow, but for me a lot faster than just CPU. I must admit that I have very little idea about Python, so thanks for the detailed tutorial. I am more of a python power user :)

2 things about the installation.

  1. under Win11 I had to start the anaconda prompt with admin rights right at the beginning and change to the user directory via CD. But then it worked fine.
  2. set SUNO_USE_DIRECTML=1 has not worked as announced by you, this is only possible manually in the config.py.
    I will test it extensively the next evenings!

Test System:
Operating System: Windows 11 Pro 64-bit (10.0, Build 22621)
Language: German (Regional Setting: German)
System Manufacturer: Gigabyte Technology Co., Ltd.
System Model: X570 AORUS PRO
Processor: AMD Ryzen 5 3600 6-Core Processor (12 CPUs), ~3.6GHz
Memory: 32768MB RAM
DirectX Version: DirectX 12
Card name: AMD Radeon RX 5700 XT VRAM 8176 MB GDDR6 1750 MHz
Driver Version: 22.40.57.05-230523a-392410C-AMD-Software-Adrenalin-Edition

Wow, you are the first confirmed success that it even works on AMD. And it's faster than CPU, that's all I was hoping for!

What was the error for 1., the reason that made you start it with admin?

There is a bug in DirectML with memory leak. I am not sure how to deal with. Maybe just restart bark every single time from 0.

i think it was missing write, access rights right at the beginning. if more errors come up, i'll make notes. It is definitely faster than with CPU.

i think it was missing write, access rights right at the beginning. if more errors come up, i'll make notes. It is definitely faster than with CPU.

IF you get a chance, I added torch2.0 install to the readme. I can't figure out if it's supposed to WORK for AMD Windows or not. The microsoft page says NO, but it seems like some people are using it. When I tried it I got a decent 30 or 40 percent speed boost, over 1.13 DirectML. But I don't have a real AMD card so it may not work.

torch2.0 works :-) It would be good if you could see in the shell the total time needed for the audio generation, then you can compare CPU vs torch1.0 vs torch2.0 better.

history_prompt: bark\assets\prompts\v2\en_speaker_1.npz (1 of 1 iterations) Segment Breakdown (Speaker: random) ┏━━━┳━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ # ┃ Words ┃ Time Est ┃ Splitting long text aiming for 165 chars max 205 ┃ ┡━━━╇━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ 1 │ 38 │ !15.20 s! │ You can't be a real country unless you have a beer and an airline. It helps if you have some kind of a football │ │ │ │ 181 chars │ team, or some nuclear weapons, but at the very least you need a beer! │ └───┴───────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ segment_text: You can't be a real country unless you have a beer and an airline. It helps if you have some kind of a football team, or some nuclear weapons, but at the very least you need a beer! --Segment 1/1: est. 15.20s (1 of 1 iterations) You can't be a real country unless you have a beer and an airline. It helps if you have some kind of a football team, or some nuclear weapons, but at the very least you need a beer! -->GPU using DirectML (partial AMD GPU support) --Loading text model from C:\Users\TestWiese\.cache\suno\bark_v0\text_2.pt to directml (partial AMD GPU support) _load_model model loaded: 312.3M params, 1.269 loss generation.py:2108 -->GPU using DirectML (partial AMD GPU support) --Loading coarse model from C:\Users\TestWiese\.cache\suno\bark_v0\coarse_2.pt to directml (partial AMD GPU support) _load_model model loaded: 314.4M params, 2.901 loss generation.py:2108 -->GPU using DirectML (partial AMD GPU support) --Loading fine model from C:\Users\TestWiese\.cache\suno\bark_v0\fine_2.pt to directml (partial AMD GPU support) _load_model model loaded: 302.1M params, 2.079 loss generation.py:2108 -->GPU using DirectML (partial AMD GPU support) write_audiofile .mp4 saved to bark_samples/You_cant_be_a_r-23-0620-1147-20-SPK-en_speaker_1.mp4 api.py:696 Saved to bark_samples/You_cant_be_a_r-23-0620-1147-20-SPK-en_speaker_1.mp4

torch2.0 works :-) It would be good if you could see in the shell the total time needed for the audio generation, then you can compare CPU vs torch1.0 vs torch2.0 better.

You can set this option in this hidden menu:

image

But is the easiest way, type python bark_perform.py instead of python bark_webui.py that will give you this. LOok for the it/s or s/it numbers. That is the speed.

image

Thank you, that's exactly what I meant. I have not had time to deal with Bark + Infinity. I installed it yesterday only briefly to test AMD. I have to take a closer look at the whole construct in the evening. Now it makes sense to deal with it :)

I have noticed 2 things: (torch.2.0)

  1. i get this warning every now and then:

C:\Users\Testwiese\bark\bark_infinity\model.py:82: UserWarning: The operator 'aten::tril.out' is currently not supported by the DML backend and will fall back to the CPU. This may have performance implications. (Triggered internally in D:\a_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:17).
y = torch.nn.functional.scaled_dot_product_attention(q, k, v, dropout_p=self.dropout, is_causal=is_causal).

--show_generation_times True does not seem to work. Neither in the GUI nor when I set it to True in config.py.
perform in the shell works.

C:\Users\Testwiese\bark\bark_infinity\model.py:82: UserWarning: The operator 'aten::tril.out' is currently not supported by the DML backend and will fall back to the CPU. This may have performance implications. (Triggered internally in D:\a_work\1\s\pytorch-directml-plugin\torch_directml\csrc\dml\dml_cpu_fallback.cpp:17).
y = torch.nn.functional.scaled_dot_product_attention(q, k, v, dropout_p=self.dropout, is_causal=is_causal)

The first thing is just a limitation, I could try rewriting the Bark code, or more likely maybe somebody already rewrote that function in a Stable Diffusion DirectML fork. But it's not a problem, it's just slower.

I'll check the time display, but did you get a boost from 2.0 versus 1.13? On my NVIDIA with directl it was maybe 30 or 40 percent.

I will do a test tonight (CET). Since the iteration display in the GUI doesn't work, I need to look at this with the prompt text, how I can always use the same text and the same speaker and then I'll test it.

But a short quick test in the GUI (same speaker, same prompt), without iteration specification gives that:
torch 1
-->Segment Finished at: 2023-06-21 13:15:38 in 87.97052574157715 seconds
-->All Segments Finished at: 2023-06-21 13:15:38 in 87.98055958747864 seconds
torch 2
-->Segment Finished at: 2023-06-21 13:10:44 in 76.91156601905823 seconds
-->All Segments Finished at: 2023-06-21 13:10:44 in 76.91358089447021 seconds

Looks like coarse goes from 2.4s down to 2.0 to 2.1, and semantic just a small increase. But still better than nothing. There's a few optimization patches I could pull in, current PRs in main Bark repo, that might give another 20 or 30 percent too.

Yes, it's not an excessive performance gain. But I'm glad it works at all and it's definitely better than nothing :-)
torchtest2.txt