csuhan/OneLLM

Vague output for audio

lixinghe1999 opened this issue · 3 comments

I slightly modify the eval code of audio to run on my dataset, however, the outputs are vague even the audio is speech.
There are all like the blow ones:

  1. A device is beeping and it gets louder and louder.
  2. A machine is running and making a high pitched sound.
  3. A machine is running and then stops suddenly.

I attach my code below

def inference_onellm(model, target_dtype, images, modal=['image']):
    if 'imu' in modal:
        inps = ['Describe the motion.'] * len(images)
    if 'audio' in modal:
        inps = ['Provide a one-sentence caption for the provided audio.'] * len(images)
        # inps = ['Provide a one-sentence action description for the provided audio.'] * len(images)
    if 'image' in modal:
        inps = ['Describe the scene.'] * len(images)
    images = images.cuda().to(target_dtype)
    prompts = []
    for inp in inps:
        conv = conv_templates["v1"].copy()        
        conv.append_message(conv.roles[0], inp)
        conv.append_message(conv.roles[1], None)
        prompts.append(conv.get_prompt())

    with torch.cuda.amp.autocast(dtype=target_dtype):
        responses = model.generate(prompts, images, 128, temperature=0.1, top_p=0.75, modal=modal)
        outputs = []
        for response, prompt in zip(responses, prompts):
            response = response[len(prompt):].split('###')[0]
            response = response.strip()
            outputs.append(response)
    return outputs
audio = torch.tensor(make_audio_features('tmp_onellm.wav', mel_bins=128).transpose(0, 1)[None, None])
result_audio = inference_onellm(model, target_dtype, audio, modal=['audio'])

Hi @lixinghe1999 , our model is mainly trained on natural sound like bird chirping, dog barking and train passing, so it is hard to distinguish human speech. Here are two solutions to enhance it:

  • Stage II: Multimodal-Text alignment on speech-text data. But it requires joint training with other modalities
  • Add a pretrained speech encoder (e.g. Whisper) to extract speech information. You can refer to https://github.com/QwenLM/Qwen-Audio

Thank you for your rapid reply. However, it still outputs meaningless results for other sounds, like musical instrument sounds. Can you give me some hints to solve it? I believe it is not necessary to retrain

Does it possible for the audio duration? Since the IMU duration is fixed to 2 seconds, I also fix the audio duration to 2 seconds

It may also be related to the sampling length. We sample 1024 frames in total.

p = target_length - n_frames
if p > 0:
m = torch.nn.ZeroPad2d((0, 0, 0, p))
fbank = m(fbank)
elif p < 0:
fbank = fbank[0:target_length, :]