Punctuation model accuracy
Opened this issue · 4 comments
Post here any feedback about the new punctuation model, experiences using it, comparisons to the old v0 model, etc
Known issues:
- Dollar sign is extremely unreliable, may need to remove it or move dollar sign to the end of a number in training data
...
ellipsis becomes..
, need to make...
its own token1999
becomes199
, may need to make99
and other double numbers their own tokens- Model can lose count of how many
,000
it has added, may need to make,000
its own token - Numbers are not very robust, probably need to train on a lot of number data
I don't know why, but this seems to want to add the word "dude" at the end of some of my sentences...I am in the US Southeast and do have somewhat of an accent, but I don't say "dude" at the end of my sentences (although it is kinda funny
Would it make sense to re-do byte-pair encoding for the model on new data to figure out what tokens to add?
Post here any feedback about the new punctuation model, experiences using it, comparisons to the old v0 model, etc
Note: I have not used the old model, only the new one.
Absolutely fantastic. I am now routinely using this model on calls and meetings. It's not quite as good as Google's captioning model for Youtube (though I think my comparison is biased in that youtubers are using professional mics and speak clearly--the ones I watch, anyway--and people in meetings are using garbage quality laptop mics and talk while vaping, etc), but it's considerably better than Google Meet's autocaptioning.
The model and the application could be obviously improved (I'll file some issues for the latter), but this is a major lifesaver for someone who occasionally has severe sensory processing issues due to fibromyalgia.
Repeated words usually end up with the wrong count, e.g. 4 blah
s spoken appearing as 5 blah
s in the transcript