This repo should contain the transcript data from Harstem's video series "Is it imba or do I suck?" This is created for research and practice purporses, where I currently practice implementing BM25 and semantic search from scratch.
I also include Python scripts for data collection and processing. Ideally, I should have some maintained way of updating data as new videos come in. Fortunately the growing compute outruns Harstem's upload speed, meaning I can just reliably search against the metadata file to check for any new videos.
Jun 10, 2024
- Parse the playlist for any new videos
- this requires a record of existing video ids: done in supabase db
- obtain info of the new videos and add record to supabase db
- download audio of the new videos
- transcribe audio of the new videos
- chunk transcript, add text to supabase db
- obtain embeddings for chunks and add embeddings to supabase db
git submodule update --init --recursive