Question: Shrinking Tokenizer Vocabulary for Reduced Memory Consumption with Pre-Trained Model (LLaMA) Fine-Tuning

Question

Question: Shrinking Tokenizer Vocabulary for Reduced Memory Consumption with Pre-Trained Model (LLaMA) Fine-Tuning

Opened this issue 2 months ago · 0 comments

Hi,

I am working on fine-tuning a LLaMA model and want to reduce the tokenizer vocabulary size to optimize memory consumption. Specifically, I would like to:

Retain special tokens, English characters, symbols, and numbers.

Remove tokens related to other languages (as I don’t need them).

My questions are:

Is it feasible to shrink the tokenizer vocabulary in this way and still use a pre-trained model for fine-tuning without affecting its performance significantly?
What are the recommended approaches or tools for modifying the tokenizer vocabulary in such cases?
Are there any caveats I should be aware of when performing this adjustment (e.g., issues with token embeddings or alignment with the pre-trained model)?
Is it a good idea at all to reduce the vocabulary size? Can it meaningfully reduce memory consumption and make generation faster?

Any guidance or references to similar implementations would be greatly appreciated.

Thank you!