Better inference for M1(text to semantic) model
rasenganai opened this issue · 0 comments
rasenganai commented
M1 is a decoder only model built on gpt, hence we can leverage the work on LLMS to speed up the model outputs.
KV caching and efficient attention working can be a start.