When to stop in the LLMEval?
MatthewWaller opened this issue ยท 12 comments
In the LLMEval project, the generation stops after reaching a limit on tokens. Is there a way to configure stopping when it finds a special token? I tried to look for the Phi 3's end token but it seems to go off the rails earlier than when <|end|> or <|endoftext|> appear. Thoughts?
It should stop at the end of sentence id: https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Evaluate.swift#L199
The fact that it's not stopping likely means it doesn't have the right EOS token ID set. Which model did you try?
@awni was working with phi34bit
Looks like this is the eos token for that model: https://huggingface.co/mlx-community/Phi-3-mini-4k-instruct-4bit-no-q-embed/blob/main/tokenizer_config.json#L340. We'll need to check to make sure the IDs match / the tokenizer is reading it correctly.
Specifically the code is looking for either the unknown token or the eos token:
if t == tokenizer.unknownTokenId || t == tokenizer.eosTokenId {
https://github.com/ml-explore/mlx-swift-examples/blob/main/Libraries/LLM/Evaluate.swift#L199
The didGenerate
block that is passed in can also return .stop
if you are implementing this yourself.
Alright, well unknownTokenId is 0 and eosTokenId is 32000, which I believe is correct, and it matches "eos_token": "<|endoftext|>", from HuggingFace. I can see in the debugger that the eosToken is <|endoftext|>. The model just never seems to produce that token. Hmmm. For instance, I can tell phi3 to "Write 3 words" and on HuggingFace chat, it appropriately stops. So I'm guessing it's producing that token for them. It just never shows up in the output I'm getting.
It may be related to this: huggingface/swift-transformers#92 -- we are not passing in a proper prompt and the generation may be impacted.
That issue is a bit terse but basically the extra tokens are not being honored when tokenizing.
Oh dang, yeah, I see that now, I pass in "<|user|>\nWrite 2 words<|end|>\n<|assistant|>\n" after preparePrompt, and that should be 9 tokens or so. But it's encoded as 24 tokens!
Saw that: huggingface/swift-transformers#92 -- has been closed and special tokens should now be accounted for. I'm still running into issues with the model itself returning the '<|end|>' token when the assistant is done, wondering if anyone has found a more manual solution to getting the correct model (phi-3) response?
I made a little project where I directly looked for that token (32001) and returned .stop if I found it, in the LLMEvaluator. Once I did that, and got the correct tokens in preparePrompt, everything worked correctly.
Gotcha, so something similar to:
let result = await MLXLLM.generate(
promptTokens: promptTokens, parameters: generateParameters, model: model,
tokenizer: tokenizer
) { tokens in
// update the output -- this will make the view show the text as it generates
let endGen = tokens.contains(32001)
if tokens.count % displayEveryNTokens == 0 {
let text = tokenizer.decode(tokens: tokens)
await MainActor.run {
self.output = text
}
}
if tokens.count >= maxTokens || endGen {
return .stop
} else {
return .more
}
}
Exactly, and heads up that there is a little bug you may run into at the end, below that bit. I had to change it to
// update the text if needed, e.g. we haven't displayed because of displayEveryNTokens
var validTokens = Array(result.tokens.prefix(while: { $0 != 32001 }))
validTokens.removeLast()
let text = tokenizer.decode(tokens: validTokens)
await MainActor.run {
if result.output != self.output {
self.output = text
}
running = false
self.stat = " Tokens/second: \(String(format: "%.3f", result.tokensPerSecond))"
}
Because you can still get the <|end|> token and more in there when it does that final bit of output.
Closing now that the main issue has been resolved with transformers.