DSP to parse audio signal into MIDI sequence

Question

DSP to parse audio signal into MIDI sequence

TurkeyMan opened this issue 11 years ago · 7 comments

Necessary to support vocals and 'pro' guitar.

Must be very low latency!

Answer 1 · 2014-01-08T17:43:21.000Z

Here a short term FFT analyzer.
https://github.com/p0nce/dplug/blob/master/dsp/dplug/dsp/fft.d

If I understand correctly you want blind separation of many sources mixed together. All I know is that for monophonic signals time-domain methods are faster, more accurate and with lower latency than FFT and for polyphonic signals it all break down and you have to go frequential, which brings quite a lot of latency.

Do you really need low latency? You might preprocess the songs.

Answer 2 · 2014-01-09T00:28:05.000Z

That helps! :)

I suspect lots of filtering/smoothing of the output will be required that will be fairly tricky to get accurate readings at very low latency.
Different voices, male/female, and picking up to 6 signals from a mixed guitar signal... these need to be made robust.

Answer 3 · 2014-01-09T10:45:57.000Z

OK (stop me if I'm wrong) the inputs are:

monophonic voice signal (a)
polyphonic guitar chords mixed together (b)

Desired output:

note onset / off
pitch

For (a), Autotune claim to use auto-correlation methods (very basically FFT of a FFT then peak detection) to detect pitch. There are rumors that it's actually time-domain, and in my experience you can have something like 10ms latency for typical material.
As for (b), Melodyne separates guitar chords, and it's an impressive tool for pitch, but I really don't know how they do it. You should ask on KVR Audio section DSP.

Note onset/offset is not that easy too since thresholds will inevitably be volume dependent.

Answer 4 · 2014-01-09T11:01:19.000Z

Sounds more or less right to me.
I have no idea how the polyphonic signal separation is done, but the vox one sounds about right.

10ms is probably okay. Frames are 16ms, and the UI layer draws later in the frame, so it can be afforded the better part of the frame (most time spent rendering the background scene).
I don't know how bad it would feel if visual response was a frame late... just one frame might be okay, but 2 is a lot. I can easily feel 2, and I'm personally pretty sensitive to even one frame latency.

It's a pretty involved piece of work. Hopefully someone more qualified than me steps forward to have a go at it! :)

Answer 5 · 2014-01-09T11:52:10.000Z

I will probably add a pitch detector to dplug, that I did for voice, I just need to port it from C++. It was meant to be secret but what the heck. It also works for monophonic harmonic signals like a single guitar chord but strangely not for pure sines.

Unfortunately the latency of the audio API (and buffer size) has a way higher impact then mere detection.
To have a simultaneous feel I had to make the audio host use ASIO and lower the buffer size to several ms.

Answer 6 · 2014-01-09T12:16:43.000Z

Yeah, I suspect some headache with the capture API's. We'll see how it goes when we get there.
I think the simpler instruments like drums will come first ;)

Answer 7 · 2014-03-22T21:55:07.000Z

https://github.com/p0nce/dplug/blob/master/dsp/dplug/dsp/goldrabiner.d

I've made a test program which output a WAV with pitch, voiced/unvoiced and a crude resynthesized output with volume = 1.
https://github.com/p0nce/dplug/blob/master/examples/pitch_detect/pitch_detect.d

The thing to get is that when there is no pitch (voicedness towards 0), the pitch output is wrong and shouldn't be used.

It can be used for monophonic voice and probably other instruments.