Can I control the duration of theText-guided style transfer's output audio?

Question

Can I control the duration of theText-guided style transfer's output audio?

Opened this issue 8 months ago · 1 comments

I tested and found that the duration of the output audio is always 10 seconds. How to modify the code to make the output audio duration consistent with the input audio duration

Answer 1 · 2024-03-29T04:51:43.000Z

It is because we first pad the spectrogram (norm_spec) to 1024 before sending it to StableDiffusionImg2ImgPipeline, and we only use the original width of the spectrogram for output. The process can be viewed:

audio, sampling_rate = load_wav(audio_path)
audio, spec = get_mel_spectrogram_from_audio(audio)
norm_spec = normalize_spectrogram(spec)
norm_spec = norm_spec[:,:, width_start:width_start+width]
norm_spec = pad_spec(norm_spec, 1024)

.....

with torch.autocast("cuda"):
    output_spec = pipe(
        prompt=prompt, image=norm_spec, num_inference_steps=100, generator=generator, output_type="pt", strength=strength, guidance_scale=7.5
    ).images[0]

# add to image_list
output_spec = output_spec[:, :, :width]
....

Hence, there are two alternatives. The first one is to use only the first width of the spectrogram as we do. The other option is that you can try not to pad before sending it to the pipeline, although I have not tried it before.