huggingface/tokenizers

Python Binding: Tokenizer.from_file() cannot parse JSON file of tokens

dwash96 opened this issue · 1 comments

I don't understand why the following token file:
https://github.com/Plachtaa/VALL-E-X/blob/master/utils/g2p/bpe_69.json

would throw this very unspecific error when the project definitely uses this library as a dependency and the file itself was presumably generated with Tokenizer.save at some point.

Traceback (most recent call last):
  File "/app/tts.py", line 335, in <module>
    cli()
  File "/app/tts.py", line 331, in cli
    convert_text_valle()
  File "/app/tts.py", line 294, in convert_text_valle
    from vallex.utils.generation import SAMPLE_RATE, generate_audio, preload_models
  File "/app/.venv/lib/python3.10/site-packages/vallex/utils/generation.py", line 63, in <module>
    text_tokenizer = PhonemeBpeTokenizer(tokenizer_path="./utils/g2p/bpe_69.json")
  File "/app/.venv/lib/python3.10/site-packages/vallex/utils/g2p/__init__.py", line 13, in __init__
    self.tokenizer = Tokenizer.from_file(tokenizer_path)
Exception: invalid type: integer `404`, expected struct Tokenizer at line 1 column 3

Has anyone seen this before and can you point me in the right direction? Is it somehow trying to load the file as a URL and getting a literal 404 not found error? I'm trying to read through the Rust code but so far, I have no intuition about what I am looking at.

I currently have version 0.19.1 installed and I have I tried to downgrade back to 0.13.1 in case the token file itself is ill-formatted but the same error gets thrown. Any help would be appreciated, thanks!

Well, never mind, I figured out in the VALL-E-X library was trying to download the token file from the wrong url. Closing.