Is there anyway to yeild from MMS asynchronously?
collinarnett opened this issue · 1 comments
collinarnett commented
I'm currently using a language model with MMS and generations are slow on the instance we're running. In order to alleviate this problem on the front end we need to return tokens as soon as the are generated instead of returning a sequence of tokens. This way the user gets immediate feedback on their generation rather than waiting for the full sequence to be returned.
Is there any way to accomplish this natively in MMS?
collinarnett commented
I think the solution we'll go for is using AWS Kinesis during inference to stream tokens as they get generated until there's support natively in MMS.