Subtitle repositories

This list isn't exaustive, but it's a start

Roadblocks

  • accessing paired translations
    • soln: crawl sub sites, pull down en and jp subs for matched films/tv shows
  • Poor translations. I.e. I've seen subs that were generated by running another language's subs through google translate
    • soln: run language model over each movie/show's corpus. if average sentance quality is below some threshold, throw it out
  • En/Jp subtitle mismatch. Sometimes the srt files don't have the same number of entries, and entries don't correspond to the same times.
    • soln: sentance alignment model. run encoder over en/jp srt files. pair up nearby sentances with similar thought vectors
  • romanji transliterations
    • soln: throw out

file formats

  • tmx
  • ass
  • srt