averak/MelBank

How to Use

Closed this issue · 3 comments

Good Afternoon,

I recently encountered this project and found it very interesting. I was wondering if it is possible to run on recorded files, or does it currently only work with streaming voices?

Alternatively, is it possible to use the models directly?

Thank you!

Good Afternoon,

Thank you for visiting to my project!

This project allows you to:

  • run on recorded files
  • work with streaming voices

On my machine, it takes 0.38 seconds to separate 1 second of audio data.
So on most machines, it will be possible to separate the streaming audio.

Usage

Separating recorded files (.wav)

# separating ./tmp/source.wav
$ python separator.py

Separating streaming voices

You need earphones to do this.

$ python spectrum_analyzer.py

Source separation is strongly influenced by acoustic context.
In this project, my voice is recorded by my own microphone and used as teacher data. Therefore, accuracy may be low with other people's voices and machines.
If you want to increase accuracy, you need to create teacher data in your environment.

If you have any other questions, please feel free to ask me questions.

Thank you!

Thank you for the swift response!

As you noted, this seems not to work as well on my own voice.

This solution requires training data and will not run without being trained or having some model. Is that correct? If so, approximately how many hours of training data do you think would be appropriate for it to identify someone else's voice?

Also, I'm having some trouble with the python spectrum_analyzer.py. It seems to crash with some error in PyAudio which may be my microphone configuration as the stream appears to cut out soon after starting. Did you have to adjust any of your sound settings?

Thank you!

This model is optimized for separating Japanese I speak.
Japanese has 24 phonemes, while English has 44.

If you want to separate English sound sources, you should create teacher data that includes them evenly.
I don't know because I haven't created any training data in English, but I think 10 minutes is enough.

How to make training data

  1. Record audio
$ make teach.record
新規話者:  # <Enter the label name of the audio to be recorded>
# Label name : "target" or "other"
  1. Build audio data
$ make teach.build

This project is currently under development and plans include a lighter model, refactoring, and a new sound source separation approach.

Bug reports and pull requests for this project are welcome!