Mobile app using machine learning to help language learners correct their accent
- download multiple audiobooks
- run audiobooks through speech to text algorithm that gives time labels for each word
- train individual ML models on all the recordings of each word
- postgres table with [ word ] | [ path to model for word ]
- Do we even need a postgres table for word -> path to model? Can't we just create a directory structure like
/models/{word}/model
? - Instead of training ML, we can use a diff of the audio and then compare the diffs? (Similar to how Shazam works)
- Prelim can be audio fingerprinting
- Could use the Call API to record outside of the app
- Person says word
- App converts word to text and makes POST to server with word and its audio
- Server finds word in database and loads ML model for it
- Runs audio from POST through ML model
- Respond with whether pronounced good or bad
- App records file
- App sends file to server
- Server downloads file and sends to gcloud api
- Server gets timestamps for each word and splices the file into word files named after STT responses
- Server passes path to word file to fingerprinter
- Fingerprinter fingeprints word file and official word file then diffs
- Server responds JSON with structure {word: diffVal, word2: diffVal2}