Guideline on test audio files

Question

Guideline on test audio files

Mauker1 opened this issue 7 years ago · 8 comments

Hello again!

I've successfully trained the SEGAN using the same database as used on the original paper, and I also managed to test it to enhance an audio file I created using my mic.

But, when I tried to test it on another audio file I had sitting around in my computer, I came across this error:

Loading model weights...
[*] Reading checkpoints...
[*] Read SEGAN-59750
test wave shape:  (4800000,)
test wave min:1.52587890625e-05  max:0.007797360420227051
Traceback (most recent call last):
  File "main.py", line 106, in <module>
    tf.app.run()
  File "C:\Users\mauke\AppData\Local\Programs\Python\Python36\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "main.py", line 97, in main
    c_wave = se_model.clean(wave)
  File "C:\Users\mauke\Documents\git\segan\model.py", line 520, in clean
    x_[:len(x)] = x
ValueError: could not broadcast input array from shape (293,16384) into shape (70,16384)

It seems to me that the test audio file isn't quite what the script was expecting. But I did convert it to a .wav file with 16kHz. So what am I missing? Are there any other requirements for the audio format?

Edit: I've used sox to downsample the audio from 44.1Khz to 16Khz. Same way that was done on the prepare_data.sh script.

Answer 1 · 2018-02-05T16:50:05.000Z

It seems that the problem is related to the audio duration.

The audio I was using is five minutes long. I've cropped it to one minute, and it worked.

Is there a duration limit?

Edit: Yeah, the problem was the duration of the audio. The clean method couldn't handle the audio if it's longer than my batch size of 70 seconds (hence the shape).

Answer 2 · 2018-02-14T18:45:38.000Z

I was using this clean method:

def clean(self, x):
    """ clean a utterance x
        x: numpy array containing the normalized noisy waveform
    """
    # zero pad if necessary
    remainder = len(x) % (2 ** 14)
    if remainder != 0:
        x = np.pad(x, (0, 2**14 - remainder), 'constant', constant_values=0)
    # split files into equal 2 ** 14 sample chunks
    x = np.array(np.array_split(x, int(len(x) / 2 ** 14)))
    x_ = np.zeros((self.batch_size, 2 ** 14))
    x_[:len(x)] = x
    fdict = {self.gtruth_noisy[0]: x_}
    output = self.sess.run(self.Gs[0], feed_dict=fdict)[:len(x)]
    output = output.flatten()
    # remove zero padding if added
    if remainder != 0:
        output = output[:-(2**14 - remainder)]
    return output

Once I switched back to the old "clean" method, it worked on larger files. The only problem is that it got super slow.

Answer 3 · 2018-02-26T22:59:18.000Z

Hey @Mauker1! Yes this is very slow, a dummy implementation (easiest thing that could be done with waste of resources :/) I have another version of this function, that batches many canvases in parallel (the one I used for many posterior experiments).

def clean(self, x):
    """ clean a utterance x
        x: numpy array containing the normalized noisy waveform
    """
    c_res = None
    for beg_i in range(0, x.shape[0], self.canvas_size):
        if x.shape[0] - beg_i  < self.canvas_size:
            length = x.shape[0] - beg_i
            pad = (self.canvas_size) - length
        else:
            length = self.canvas_size
            pad = 0
        x_ = np.zeros((self.batch_size, self.canvas_size))
        if pad > 0:
            x_[0] = np.concatenate((x[beg_i:beg_i + length], np.zeros(pad)))
        else:
            x_[0] = x[beg_i:beg_i + length]
        print('Cleaning chunk {} -> {}'.format(beg_i, beg_i + length))
        fdict = {self.gtruth_noisy[0]:x_}
        canvas_w = self.sess.run(self.Gs[0],
                                 feed_dict=fdict)[0]
        canvas_w = canvas_w.reshape((self.canvas_size))
        print('canvas w shape: ', canvas_w.shape)
        if pad > 0:
            print('Removing padding of {} samples'.format(pad))
            # get rid of last padded samples
            canvas_w = canvas_w[:-pad]
        if c_res is None:
            c_res = canvas_w
        else:
            c_res = np.concatenate((c_res, canvas_w))
    # deemphasize
    c_res = de_emph(c_res, self.preemph)
    return c_res

Answer 4 · 2018-03-05T04:59:29.000Z

Hi @Mauker1 please i need help with the Loading and Prediction section which is the last section.

I havent been able to figure it out.

"Then the main.py script has the option to process a wav file through the G network (inference mode), where the user MUST specify the trained weights file and the configuration of the trained network." where will the configuration be made and what precisely will i have to alter to make the system work. Thanks

Answer 5 · 2018-03-05T07:55:27.000Z

I have solved that issue, but when i tried to test a sample file, my audio file was totally cleaned (couldn't hear any sound). Please what could have been the problem

Answer 6 · 2018-03-05T08:18:30.000Z

i tested another sample and it worked fine, thanks... just left for me to test with my own generated wav files

Answer 7 · 2018-06-27T11:28:35.000Z

What's your version of python and tensorflow?

Answer 8 · 2018-10-24T09:48:17.000Z

I have been facing a weird issue while testing. I successfully trained the SEGAN model for 19440 iterations for a batch size of 100. During training at the save_freq the max and min values of the generated sample audios are printed. Here, almost all the audio files vary from +0.55.... to -0.5....

Now, during testing for the same audio file in the training set for the same weights, the output behave like this:

test wave min:-0.42119479179382324  max:0.497093141078949
[*] Reading checkpoints...
[*] Read SEGAN-19440
[*] Load SUCCESS
Cleaning chunk 0 -> 16384
gen wave, max:  [0.96146643] min:  [-0.9862874]
inp wave, max:  0.497093141078949 min:  -0.42119479179382324
canvas w shape:  (16384, 1)
Cleaning chunk 16384 -> 32768
gen wave, max:  [0.9773201] min:  [-0.9757471]
inp wave, max:  0.3213702440261841 min:  -0.2770885229110718
canvas w shape:  (16384, 1)
Cleaning chunk 32768 -> 36480
gen wave, max:  [0.99999225] min:  [-0.9999961]
inp wave, max:  0.04255741834640503 min:  -0.041153550148010254
canvas w shape:  (16384, 1)

The generated wav sounds even noisier than before and the speech segments sound extremely loud and distorted. I have no idea why this would be happening? Need some help please.