Llamatune fails with your example code from its home page
Opened this issue · 2 comments
IridiumMaster commented
steps to reproduce
- start a runpod container with the pytorch 2.01 template and lots of disk space
- run your sample command on a properly formatted dataset:
python -m llamatune.train
--model_name meta-llama/Llama-2-13b-chat-hf
--data_path master_qa.json
--training_recipe lora
--batch_size 8
--gradient_accumulation_steps 4
--learning_rate 1e-4
--output_dir chat_llama2_13b
--use_auth_token xxxzzz - result is:
Model ready for training!
trainable params: 250347520 || all params: 6922337280 || trainable: 3.616517223500557
WARNING:root:Loading data...
WARNING:root:Tokenizing inputs... This may take some time...
config TrainingConfig(model_name='meta-llama/Llama-2-13b-chat-hf', data_path='master_qa.json', output_dir='chat_llama2_13b', training_recipe='lora', optim='paged_adamw_8bit', batch_size=8, gradient_accumulation_steps=4, n_epochs=3, weight_decay=0.0, learning_rate=0.0001, max_grad_norm=0.3, gradient_checkpointing=True, do_train=True, lr_scheduler_type='cosine', warmup_ratio=0.03, logging_steps=1, group_by_length=True, save_strategy='epoch', save_total_limit=3, fp16=True, tokenizer_type='llama', trust_remote_code=False, compute_dtype=torch.float16, max_tokens=4096, do_eval=True, evaluation_strategy='epoch', use_auth_token='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', use_fast=False, bits=4, double_quant=True, quant_type='nf4', lora_r=64, lora_alpha=16, lora_dropout=0.0)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/llamatune/train.py", line 50, in
trainer.train()
File "/usr/local/lib/python3.10/dist-packages/llamatune/trainer.py", line 25, in train
self.model_engine.train(data_module=self.data_module)
File "/usr/local/lib/python3.10/dist-packages/llamatune/model_engines/llama_model_engine.py", line 33, in train
trainer = Trainer(
File "/usr/local/lib/python3.10/dist-packages/transformers/trainer.py", line 405, in init
raise ValueError(
ValueError: The model you want to train is loaded in 8-bit precision. if you want to fine-tune an 8-bit model, please make sure that you have installedbitsandbytes>=0.41.1
.
jayantkhannadocplix1 commented
Hello @IridiumMaster
I encountered a similar problem and managed to resolve it by executing
pip install bitsandbytes==0.41.1
IridiumMaster commented
aye, that is the correct fix. Hoping the maintainers change their requirements.txt