/Visual_Aware_E2E_ST

This repo is made for the GSoC2024 proposal for the project Visual Aware E2E Speech Transcription listed by Red Hen Lab.

Primary LanguageJupyter NotebookMIT LicenseMIT

Introduction

  • This repo is made for updating GSoC 2024 project proposal with some helper codes that I for the project proposed.
  • Please download relevant data of audioset from the link
  • Download samples of People's Speech Dataset from here link
  • Note that you can also use utils.ipynb as a scrip to download these datasets.
  • Other tools that I have tried for the proposal include - yt-dlp, shot_detection, video_llava