huggingface/tokenizers

Strange warnings with tokenizer for some models

EricLBuehler opened this issue · 3 comments

Hello all,

Thank you for your excellent work here! We are using Tokenizer::from_file to load the tokenizer.json file from HF hub. However, it produces many warnings when loading the Phi3 tokenizer:

2024-05-09T12:11:56.647710Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|endoftext|>' was expected to have ID '32000' but was given ID 'None'    
2024-05-09T12:11:56.647734Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|assistant|>' was expected to have ID '32001' but was given ID 'None'    
2024-05-09T12:11:56.647737Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder1|>' was expected to have ID '32002' but was given ID 'None'    
2024-05-09T12:11:56.647739Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder2|>' was expected to have ID '32003' but was given ID 'None'    
2024-05-09T12:11:56.647742Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder3|>' was expected to have ID '32004' but was given ID 'None'    
2024-05-09T12:11:56.647744Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder4|>' was expected to have ID '32005' but was given ID 'None'    
2024-05-09T12:11:56.647746Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|system|>' was expected to have ID '32006' but was given ID 'None'    
2024-05-09T12:11:56.647748Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|end|>' was expected to have ID '32007' but was given ID 'None'    
2024-05-09T12:11:56.647750Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder5|>' was expected to have ID '32008' but was given ID 'None'    
2024-05-09T12:11:56.647752Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|placeholder6|>' was expected to have ID '32009' but was given ID 'None'    
2024-05-09T12:11:56.647760Z  WARN tokenizers::tokenizer::serialization: Warning: Token '<|user|>' was expected to have ID '32010' but was given ID 'None'    

I have also noticed this for Phi2 and Llama3, although I see no tokenization errors in the encoded or decoded.

Is there a way to disable this warning, or am I misconfiguring something? Thank you!

Fixed by this gist: https://gist.github.com/jneuff/682d47b786329f19291d166957b3274a

Seems to be an issue with the tokenizer.json file.

Which files on the hub are you using? And which tokenizers version?
It's a bit weird and should not be happening.

@ArthurZucker, I am using tokenizers version 0.19.1 and this tokenizer file:

tokenizers = "0.19.1"

Edit:
Loading with this function demonstrates the issue:

pub(crate) fn get_tokenizer<P: AsRef<Path> + Clone>(p: P) -> Result<Tokenizer> {
    Tokenizer::from_file(p).map_err(anyhow::Error::msg)
}

But this fixes it:

pub(crate) fn get_tokenizer<P: AsRef<Path> + Clone>(p: P) -> Result<Tokenizer> {
    let fixed_path = format!("{}_mistralrs_fixed", p.as_ref().display());
    let fixed_path = Path::new(&fixed_path);

    if !fixed_path.exists() {
        let raw = std::fs::read(p.clone()).map_err(anyhow::Error::msg)?;
        let mut tokenizer: Value = serde_json::from_slice(&raw).unwrap();
        let added_tokens: Vec<AddedToken> =
            serde_json::from_value(tokenizer["added_tokens"].clone()).unwrap();
        let vocab: HashMap<String, usize> =
            serde_json::from_value(tokenizer["model"]["vocab"].clone()).unwrap();
        for token in added_tokens {
            if !vocab.contains_key(&token.content) {
                tokenizer["model"]["vocab"]
                    .as_object_mut()
                    .unwrap()
                    .insert(token.content, token.id.into())
                    .ok_or(())
                    .unwrap_err();
            }
        }
        let raw_fixed = serde_json::to_vec_pretty(&tokenizer).unwrap();
        std::fs::write(fixed_path, raw_fixed).unwrap();
    }

    Tokenizer::from_file(fixed_path).map_err(anyhow::Error::msg)
}