Strange warnings with tokenizer for some models
EricLBuehler opened this issue · 3 comments
Hello all,
Thank you for your excellent work here! We are using Tokenizer::from_file
to load the tokenizer.json
file from HF hub. However, it produces many warnings when loading the Phi3 tokenizer:
2024-05-09T12:11:56.647710Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|endoftext|>' was expected to have ID '32000' but was given ID 'None'
2024-05-09T12:11:56.647734Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|assistant|>' was expected to have ID '32001' but was given ID 'None'
2024-05-09T12:11:56.647737Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder1|>' was expected to have ID '32002' but was given ID 'None'
2024-05-09T12:11:56.647739Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder2|>' was expected to have ID '32003' but was given ID 'None'
2024-05-09T12:11:56.647742Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder3|>' was expected to have ID '32004' but was given ID 'None'
2024-05-09T12:11:56.647744Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder4|>' was expected to have ID '32005' but was given ID 'None'
2024-05-09T12:11:56.647746Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|system|>' was expected to have ID '32006' but was given ID 'None'
2024-05-09T12:11:56.647748Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|end|>' was expected to have ID '32007' but was given ID 'None'
2024-05-09T12:11:56.647750Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder5|>' was expected to have ID '32008' but was given ID 'None'
2024-05-09T12:11:56.647752Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder6|>' was expected to have ID '32009' but was given ID 'None'
2024-05-09T12:11:56.647760Z WARN tokenizers::tokenizer::serialization: Warning: Token '<|user|>' was expected to have ID '32010' but was given ID 'None'
I have also noticed this for Phi2 and Llama3, although I see no tokenization errors in the encoded or decoded.
Is there a way to disable this warning, or am I misconfiguring something? Thank you!
Fixed by this gist: https://gist.github.com/jneuff/682d47b786329f19291d166957b3274a
Seems to be an issue with the tokenizer.json file.
Which files on the hub are you using? And which tokenizers version?
It's a bit weird and should not be happening.
@ArthurZucker, I am using tokenizers version 0.19.1 and this tokenizer file:
tokenizers = "0.19.1"
Edit:
Loading with this function demonstrates the issue:
pub(crate) fn get_tokenizer<P: AsRef<Path> + Clone>(p: P) -> Result<Tokenizer> {
Tokenizer::from_file(p).map_err(anyhow::Error::msg)
}
But this fixes it:
pub(crate) fn get_tokenizer<P: AsRef<Path> + Clone>(p: P) -> Result<Tokenizer> {
let fixed_path = format!("{}_mistralrs_fixed", p.as_ref().display());
let fixed_path = Path::new(&fixed_path);
if !fixed_path.exists() {
let raw = std::fs::read(p.clone()).map_err(anyhow::Error::msg)?;
let mut tokenizer: Value = serde_json::from_slice(&raw).unwrap();
let added_tokens: Vec<AddedToken> =
serde_json::from_value(tokenizer["added_tokens"].clone()).unwrap();
let vocab: HashMap<String, usize> =
serde_json::from_value(tokenizer["model"]["vocab"].clone()).unwrap();
for token in added_tokens {
if !vocab.contains_key(&token.content) {
tokenizer["model"]["vocab"]
.as_object_mut()
.unwrap()
.insert(token.content, token.id.into())
.ok_or(())
.unwrap_err();
}
}
let raw_fixed = serde_json::to_vec_pretty(&tokenizer).unwrap();
std::fs::write(fixed_path, raw_fixed).unwrap();
}
Tokenizer::from_file(fixed_path).map_err(anyhow::Error::msg)
}