kadirnar/whisper-plus

[CONTRIBUTION] Speech Dataset Generator

davidmartinrius opened this issue · 6 comments

Hi everyone!

My name is David Martin Rius and I have just published this project on GitHub: https://github.com/davidmartinrius/speech-dataset-generator/

Now you can create datasets automatically with any audio or lists of audios.

I hope you find it useful.

Here are the key functionalities of the project:

  1. Dataset Generation: Creation of multilingual datasets with Mean Opinion Score (MOS).

  2. Silence Removal: It includes a feature to remove silences from audio files, enhancing the overall quality.

  3. Sound Quality Improvement: It improves the quality of the audio when needed.

  4. Audio Segmentation: It can segment audio files within specified second ranges.

  5. Transcription: The project transcribes the segmented audio, providing a textual representation.

  6. Gender Identification: It identifies the gender of each speaker in the audio.

  7. Pyannote Embeddings: Utilizes pyannote embeddings for speaker detection across multiple audio files.

  8. Automatic Speaker Naming: Automatically assigns names to speakers detected in multiple audios.

  9. Multiple Speaker Detection: Capable of detecting multiple speakers within each audio file.

  10. Store speaker embeddings: The speakers are detected and stored in a Chroma database, so you do not need to assign a speaker name.

  11. Syllabic and words-per-minute metrics

Feel free to explore the project at https://github.com/davidmartinrius/speech-dataset-generator

Want to add it to the WhisperPlus library? I would like to add this feature but I don't have time.

Absolutely!

I think I can upload the speech dataset generator to pipy, so it could be installed from requirements.txt or setup.py

Then we could do somethong like:

from whisperplus import WhisperSpeechDatasetGeneratorPipeline

speech_dataset_generator = WhisperSpeechDatasetGeneratorPipeline(model_id="openai/whisper-large-v3")
speech_dataset_generator(video_path="test.mp4", output_path="./speech_dataset", audio_enhancers=["deepfilternet"], datasets=["ljspeech","librispeech"], range_times=(5,15))

What do you think about this idea? How would you do an integration?

Thank you!

Is there a pip package for this library?

Is there a pip package for this library?

Next week, I plan to upload it to PyPI. Keep in mind that this project is only 2 weeks old, and I'm actively releasing new features. One of the next steps is to make the project available on PyPI.

Is there a pip package for this library?

Next week, I plan to upload it to PyPI. Keep in mind that this project is only 2 weeks old, and I'm actively releasing new features. One of the next steps is to make the project available on PyPI.

Is there an update?

Yes, not possible to do it because some dependencies are not available on PyPi. I could integrate those dependencies to this project, but I have not enough time for it.
You can install this project from requirements.txt or setup.py as it is explained in the README.md