eval_loss showing Nan but train_loss decreases and goes to NaN after couple of steps while fine tuning gemma model with additional vocab
sidtandon2014 opened this issue · 2 comments
System Info
I am trying to fine tune gemma 7b model in 4 bit with additional vocab and using following configuration, but getting NaN in train and eval loss. Though train loss first decreases for couple of steps and then turn to NaN
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model = AutoModelForCausalLM.from_pretrained(model_id + MODEL_NAME
, quantization_config=bnb_config
, token=os.environ['HF_TOKEN']
, device_map={'':device_string}
, use_cache=False
)
lora_config = LoraConfig(
r=LORA_RANK,
target_modules=["q_proj", "v_proj", "embed_tokens", "lm_head"],
task_type="CAUSAL_LM",
lora_alpha = LORA_ALPHA,
lora_dropout = LORA_DROPOUT,
bias = "none",
)
model = get_peft_model(model, lora_config)
In order to update the vocab I have extended sentencepiece model instead of add_tokens method (FYI: add_tokens is degrading tokens quality)
huggingface/tokenizers#627 (comment)
https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb
Along with this while training, I am setting embedding values to 0 for all new tokens
emb_dim = model.model.embed_tokens.weight.shape
with torch.no_grad():
model.model.embed_tokens.weight[-NEW_TOKENS:] = torch.zeros((NEW_TOKENS, emb_dim[1]))
Additional properties:
args = TrainingArguments(
per_device_train_batch_size=BATCH_SIZE,
per_device_eval_batch_size=BATCH_SIZE,
gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
save_steps=200,
save_total_limit=20,
save_strategy="steps",
evaluation_strategy='steps',
eval_steps=200,
logging_steps=200,
warmup_steps=2,
num_train_epochs =EPOCHS,
# max_steps=2,
learning_rate=2e-4,
lr_scheduler_type = "cosine",
weight_decay = 0.001,
max_grad_norm=1.0,
fp16 = False,
bf16 = True,
logging_strategy = "steps",
output_dir=output_dir,
optim="paged_adamw_8bit",
seed=42,
gradient_checkpointing = True,
gradient_checkpointing_kwargs={'use_reentrant':False},
#accelerator_config = {'split_batches' : True},
report_to = None
)
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder - My own task or dataset (give details below)
Reproduction
Task: Translate Sanskrit to English
Dataset:"rahular/itihasa"
Loss Snapshot: [A{'eval_loss': nan, 'eval_runtime': 708.8687, 'eval_samples_per_second': 13.125, 'eval_steps_per_second': 1.641, 'epoch': 0.15}
Expected behavior
Validation loss should not be NaN
Can you try to run this additional snippet:
model = get_peft_model(...)
# convert all peft parameters to float32
for param in model.parameters():
if param.requires_grad:
param.data = param.data.float()