yxlllc/DDSP-SVC

Failure at "audio_callback" in gui_diff.py preventing usage

danieloneill opened this issue · 2 comments

Of my sound devices, it works fine with my USB headset, but attempting to use pipewire, default (which is a Pulse backend), or Jack results in different errors. I'm not convinced one (or all) of these aren't a sounddevice issue.

Still, the result is no audio with any device selections besides directly to my USB headset.

event: start_vc
input device:21:default (ALSA)
output device:21:default (ALSA)
crossfade_time:0.06
buffer_num:4
samplerate:44100
block_time:0.8
prefix_pad_length:3.1100000000000003
mix_mode:None
using_cuda:True
 [DDSP Model] Combtooth Subtractive Synthesiser
 [Loading] /Sabrent/gpt/DDSP-SVC/exp/diffusion-test/model_100000.pt
 [Encoder Model] Content Vec
 [Loading] pretrain/contentvec/checkpoint_best_legacy_500.pt
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | current directory is /Sabrent/gpt/DDSP-SVC
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-10-31 17:04:17 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}

Starting callback
Infering...
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
| Load HifiGAN:  pretrain/nsf_hifigan/model
...
sola_shift: 0
Exception ignored from cffi callback <function _StreamBase.__init__.<locals>.callback_ptr at 0x7fa5f96b6f70>:
Traceback (most recent call last):
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 886, in callback_ptr
    return _wrap_callback(
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 2687, in _wrap_callback
    callback(*args)
  File "/Sabrent/gpt/DDSP-SVC/gui_diff.py", line 489, in audio_callback
    outdata[:] = temp_wav[: - self.crossfade_frame, None].repeat(1, 2).cpu().numpy()
ValueError: could not broadcast input array from shape (35280,2) into shape (35280,64)
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
event: stop_vc
Audio block passed.
ENDing VC

When using "pipewire":

event: start_vc
input device:21:default (ALSA)
output device:21:default (ALSA)
crossfade_time:0.06
buffer_num:4
samplerate:44100
block_time:0.8
prefix_pad_length:3.1100000000000003
mix_mode:None
using_cuda:True
 [DDSP Model] Combtooth Subtractive Synthesiser
/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
 [Loading] /Sabrent/gpt/DDSP-SVC/exp/diffusion-test/model_100000.pt
 [Encoder Model] Content Vec
 [Loading] pretrain/contentvec/checkpoint_best_legacy_500.pt
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | current directory is /Sabrent/gpt/DDSP-SVC
2023-10-31 17:04:17 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
2023-10-31 17:04:17 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}

Starting callback
Infering...
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
| Load HifiGAN:  pretrain/nsf_hifigan/model
...
Audio block passed.
Removing weight norm...
sola_shift: 0
Exception ignored from cffi callback <function _StreamBase.__init__.<locals>.callback_ptr at 0x7fa5801e1700>:
Traceback (most recent call last):
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 886, in callback_ptr
    return _wrap_callback(
  File "/Sabrent/gpt/DDSP-SVC/venv/lib64/python3.9/site-packages/sounddevice.py", line 2687, in _wrap_callback
    callback(*args)
  File "/Sabrent/gpt/DDSP-SVC/gui_diff.py", line 489, in audio_callback
    outdata[:] = temp_wav[: - self.crossfade_frame, None].repeat(1, 2).cpu().numpy()
ValueError: could not broadcast input array from shape (35280,2) into shape (35280,64)
Audio block passed.
Audio block passed.
Audio block passed.
Audio block passed.
event: stop_vc
Audio block passed.
ENDing VC

The last one, JACK, is the most baffling. It dies with SIGKILL, which I'm not issuing myself. I see no messages in the journalctl about it whatsoever, either, so I'm not sure what's causing it:

event: start_vc
input device:22:G733 Gaming Headset Mono (JACK Audio Connection Kit)
output device:25:G733 Gaming Headset Analog Stereo (JACK Audio Connection Kit)
crossfade_time:0.06
buffer_num:4
samplerate:44100
block_time:0.8
prefix_pad_length:3.1100000000000003
mix_mode:None
using_cuda:True
 [DDSP Model] Combtooth Subtractive Synthesiser
 [Loading] /Sabrent/gpt/DDSP-SVC/exp/diffusion-test/model_100000.pt

Starting callback
Infering...
Audio block passed.
Killed
(venv) [doneill@galena DDSP-SVC]$ 

According to my tests, only MME is the most stable driver, the others are very random, and may be a problem with the sounddevice library.

I've found that if I modify sounddevice.py to force 1 input channel and 2 output channels, it works as expected. It seems the output device is being instantiated with the "available channels", which on pipewire devices is typically 64, but the audio samples array only contains 2 channel data.