Inconsistent results, Mean Square Error is high between runs.
Closed this issue · 2 comments
This is based on the code in the master branch of this repository. Two runs of get_audio_embeddings
on the same audio file does not produce identical results, and the MSE is 0.1367.
My testing code:
# Load model (Choose between versions '2022' or '2023')
from CLAPWrapper import CLAPWrapper as CLAP
import torch
with torch.no_grad():
clap_model = CLAP("/Users/azalea/Downloads/CLAP_weights_2023.pth", version = '2023', use_cuda=False)
# Extract text embeddings
# text_embeddings = clap_model.get_text_embeddings(class_labels: List[str])
audio_embeddings_1 = clap_model.get_audio_embeddings(["/Users/azalea/Downloads/test.wav"])
audio_embeddings_2 = clap_model.get_audio_embeddings(["/Users/azalea/Downloads/test.wav"])
print(audio_embeddings_1)
print(audio_embeddings_2)
# Compute mean square error
mse = torch.mean((audio_embeddings_1 - audio_embeddings_2)**2)
print(mse)
OS: macOS Sonoma 14.0
Python: Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:39:40) [Clang 15.0.7 ]
The audio file metadata:
> ffprobe test.wav -hide_banner
Input #0, wav, from 'test.wav':
Metadata:
artist : ハンバート ハンバート
date : 2006
title : 日が落ちるまで
album : 道はつづく
track : 7
encoder : Lavf60.3.100
Duration: 00:04:56.49, bitrate: 1411 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, 2 channels, s16, 1411 kb/s
Hi @hykilpikonna, the code supports files up to 7 seconds in length. So if you pass a file greater than 7 seconds (in your case it's ~5mins) then it will randomly sample a 7-second segment and provide embeddings/predictions on that segment. I think this is the cause of high variance in predictions.
I would recommend either chunking your file in 7-second (or lower) files or updating CLAPWrapper.py to chunk and accumulate predictions.
Hi @hykilpikonna, the code supports files up to 7 seconds in length. So if you pass a file greater than 7 seconds (in your case it's ~5mins) then it will randomly sample a 7-second segment and provide embeddings/predictions on that segment. I think this is the cause of high variance in predictions.
I would recommend either chunking your file in 7-second (or lower) files or updating CLAPWrapper.py to chunk and accumulate predictions.
Thank you for the clarification.