redotvideo/haven

Llamatune fails with your example code from its home page

Opened this issue · 2 comments

steps to reproduce

  1. start a runpod container with the pytorch 2.01 template and lots of disk space
  2. run your sample command on a properly formatted dataset:
    python -m llamatune.train
    --model_name meta-llama/Llama-2-13b-chat-hf
    --data_path master_qa.json
    --training_recipe lora
    --batch_size 8
    --gradient_accumulation_steps 4
    --learning_rate 1e-4
    --output_dir chat_llama2_13b
    --use_auth_token xxxzzz
  3. result is:
    Model ready for training!
    trainable params: 250347520 || all params: 6922337280 || trainable: 3.616517223500557
    WARNING:root:Loading data...
    WARNING:root:Tokenizing inputs... This may take some time...
    config TrainingConfig(model_name='meta-llama/Llama-2-13b-chat-hf', data_path='master_qa.json', output_dir='chat_llama2_13b', training_recipe='lora', optim='paged_adamw_8bit', batch_size=8, gradient_accumulation_steps=4, n_epochs=3, weight_decay=0.0, learning_rate=0.0001, max_grad_norm=0.3, gradient_checkpointing=True, do_train=True, lr_scheduler_type='cosine', warmup_ratio=0.03, logging_steps=1, group_by_length=True, save_strategy='epoch', save_total_limit=3, fp16=True, tokenizer_type='llama', trust_remote_code=False, compute_dtype=torch.float16, max_tokens=4096, do_eval=True, evaluation_strategy='epoch', use_auth_token='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', use_fast=False, bits=4, double_quant=True, quant_type='nf4', lora_r=64, lora_alpha=16, lora_dropout=0.0)
    Traceback (most recent call last):
    File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
    File "/usr/local/lib/python3.10/dist-packages/llamatune/train.py", line 50, in
    trainer.train()
    File "/usr/local/lib/python3.10/dist-packages/llamatune/trainer.py", line 25, in train
    self.model_engine.train(data_module=self.data_module)
    File "/usr/local/lib/python3.10/dist-packages/llamatune/model_engines/llama_model_engine.py", line 33, in train
    trainer = Trainer(
    File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 405, in init
    raise ValueError(
    ValueError: The model you want to train is loaded in 8-bit precision. if you want to fine-tune an 8-bit model, please make sure that you have installed bitsandbytes>=0.41.1.

Hello @IridiumMaster
I encountered a similar problem and managed to resolve it by executing

pip install bitsandbytes==0.41.1

aye, that is the correct fix. Hoping the maintainers change their requirements.txt