meinardmueller/synctoolbox

ELI5 How do I compute the seconds I need to shift a song?

Closed this issue · 3 comments

Hi! I'm sorry, but reading the jupyter notebook or the code didn't really help.

I was able to run this code with no errors:

def find_offset(within_file, find_file):
    Fs = 22050
    N = 2048
    H = 1024
    feature_rate = int(22050 / H)

    audio_1, _ = librosa.load(within_file, sr=Fs)
    audio_2, _ = librosa.load(find_file, sr=Fs)
    chroma_1 = librosa.feature.chroma_stft(y=audio_1, sr=Fs, n_fft=N, hop_length=H, norm=2.0)
    chroma_2 = librosa.feature.chroma_stft(y=audio_2, sr=Fs, n_fft=N, hop_length=H, norm=2.0)
    result = sync_via_mrmsdtw(f_chroma1=chroma_1,
                     f_chroma2=chroma_2,
                              verbose=True,
                     input_feature_rate=feature_rate)
    print(result)

But it's unclear to me how I can calculate the seconds I need to shift the songs in order to align them.

I am currently using a very basic function to compute it which works in 90% of the cases, but I was looking into this package for getting a much better success rate.

def find_offset(within_file, find_file, window=10):
    start_time = time.time()
    y_within, sr_within = librosa.load(within_file, sr=None)
    y_find, _ = librosa.load(find_file, sr=sr_within)
    c = signal.correlate(y_within, y_find[:sr_within * window], mode='valid', method='fft')
    peak = np.argmax(c)
    offset = peak / sr_within

Or maybe I'm completely misunderstanding what is possible to do using this package

To be clear, what I want to achieve is alignment between an audio recorded with a phone/camera and an original audio .mp3 file.

Thank you in advance

Hi @mcosti,

Regarding your question "But it's unclear to me how I can calculate the seconds I need to shift the songs in order to align them.", there are two approaches you might find useful.

The first one is the use of the offset argument of the librosa.load function. Assuming that you want to shift the audio by 2 seconds, you can consider using the following

shifted_audio = librosa.load(filepath, offset=2)

For the documentation of the function: librosa.load.

A second solution could be slicing the resulting audio representations, e.g., chroma features, similar to what you did in the function find_offset. For example, assuming that you want to shift the chroma features by 2 seconds, the following code snippet can be helpful:

audio, _ = librosa.load(filepath, sr=Fs)
frame_shift = int(2 * (Fs / H))
chroma = librosa.feature.chroma_stft(y=audio, sr=Fs, n_fft=N, hop_length=H, norm=2.0)
shifted_chroma = chroma[:, frame_shift:]

I hope this is helpful.

Hey @yiitozer . I was more trying to find a way to calculate this number of "2" seconds, and I thought maybe I can make use of your amazing work in this library.

But after reading the code more, I think I misunderstood the usage of this library and it does not help my usecase.

Hey @mcosti , that's correct, our library cannot automatically detect that. You may find Subsequence DTW helpful for your task.