Description

This repo should contain the transcript data from Harstem's video series "Is it imba or do I suck?" This is created for research and practice purporses, where I currently practice implementing BM25 and semantic search from scratch.

I also include Python scripts for data collection and processing. Ideally, I should have some maintained way of updating data as new videos come in. Fortunately the growing compute outruns Harstem's upload speed, meaning I can just reliably search against the metadata file to check for any new videos.

Current data cut off

Jun 10, 2024

Data pipeline

Parse the playlist for any new videos
1. this requires a record of existing video ids: done in supabase db
obtain info of the new videos and add record to supabase db
download audio of the new videos
transcribe audio of the new videos
chunk transcript, add text to supabase db
obtain embeddings for chunks and add embeddings to supabase db

Cloning the mlx-parallm submodule ran by will brown

git submodule update --init --recursive

yxzwayne/iodis-data-rsch

Description

Current data cut off

Data pipeline

Cloning the mlx-parallm submodule ran by will brown