sshh12/multi_token

wait, what is that in my training?

guilh00009 opened this issue · 1 comments

training code: !!cd multi_token && deepspeed scripts/train_model.py
--model_name_or_path Guilherme34/Samantha-v2
--model_cls MistralLMMForCausalLM
--modality_builder vision_clip
--dataset_path /content/conversation_58k.json
--output_dir /data/output/my_lmm_pretrain
--pretrain_projectors
--lora_enable True
--fp16 True
--tf32 False
--num_train_epochs 1
--gradient_checkpointing True
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 32
--model_max_length 2048
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 1000
--save_total_limit 1
--learning_rate 1e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--dataloader_num_workers 2
--logging_steps 1
--report_to wandb
--deepspeed ./configs/zero2.json training output: ]
26s
!!cd multi_token && deepspeed scripts/train_model.py
--model_name_or_path Guilherme34/Samantha-v2
--model_cls MistralLMMForCausalLM
--modality_builder vision_clip
--dataset_path /content/conversation_58k.json
--output_dir /data/output/my_lmm_pretrain
--pretrain_projectors
--lora_enable True
--fp16 True
--tf32 False
… --learning_rate 1e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--dataloader_num_workers 2
--logging_steps 1
--report_to wandb
--deepspeed ./configs/zero2.json
output
['[2024-01-06 17:01:39,013] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)',
'[2024-01-06 17:01:41,195] [WARNING] [runner.py:196:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.',
'[2024-01-06 17:01:41,195] [INFO] [runner.py:555:main] cmd = /usr/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None scripts/train_model.py --model_name_or_path Guilherme34/Samantha-v2 --model_cls MistralLMMForCausalLM --modality_builder vision_clip --dataset_path /content/conversation_58k.json --output_dir /data/output/my_lmm_pretrain --pretrain_projectors --lora_enable True --fp16 True --tf32 False --num_train_epochs 1 --gradient_checkpointing True --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 32 --model_max_length 2048 --evaluation_strategy no --save_strategy steps --save_steps 1000 --save_total_limit 1 --learning_rate 1e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type cosine --dataloader_num_workers 2 --logging_steps 1 --report_to wandb --deepspeed ./configs/zero2.json',
'[2024-01-06 17:01:43,655] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)',
'[2024-01-06 17:01:46,134] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE=libnccl-dev=2.19.3-1+cuda12.2',
'[2024-01-06 17:01:46,134] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_VERSION=2.19.3-1',
'[2024-01-06 17:01:46,134] [INFO] [launch.py:138:main] 0 NCCL_VERSION=2.19.3-1',
'[2024-01-06 17:01:46,134] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_DEV_PACKAGE_NAME=libnccl-dev',
'[2024-01-06 17:01:46,134] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE=libnccl2=2.19.3-1+cuda12.2',
'[2024-01-06 17:01:46,134] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_NAME=libnccl2',
'[2024-01-06 17:01:46,134] [INFO] [launch.py:138:main] 0 NV_LIBNCCL_PACKAGE_VERSION=2.19.3-1',
"[2024-01-06 17:01:46,135] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0]}",
'[2024-01-06 17:01:46,135] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=1, node_rank=0',
"[2024-01-06 17:01:46,135] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})",
'[2024-01-06 17:01:46,135] [INFO] [launch.py:163:main] dist_world_size=1',
'[2024-01-06 17:01:46,135] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0',
'2024-01-06 17:01:49.779286: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered',
'2024-01-06 17:01:49.779332: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered',
'2024-01-06 17:01:49.780699: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered',
'2024-01-06 17:01:50.922045: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT',
'[2024-01-06 17:01:52,253] [INFO] [real_accelerator.py:110:get_accelerator] Setting ds_accelerator to cuda (auto detect)',
'[2024-01-06 17:01:52,387] [WARNING] [comm.py:152:init_deepspeed_backend] NCCL backend in DeepSpeed not yet implemented',
'[2024-01-06 17:01:52,387] [INFO] [comm.py:594:init_distributed] cdb=None',
'[2024-01-06 17:01:52,387] [INFO] [comm.py:625:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl',
'Traceback (most recent call last):',
' File "/content/multi_token/scripts/train_model.py", line 29, in ',
' train_for_modalities(model_cls, training_args, model_args, data_args, modalities)',
' File "/content/multi_token/multi_token/training.py", line 162, in train_for_modalities',
' tokenizer = transformers.AutoTokenizer.from_pretrained(',
' File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/tokenization_auto.py", line 768, in from_pretrained',
' return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)',
' File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained',
' return cls._from_pretrained(',
' File "/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained',
' tokenizer = cls(*init_inputs, **init_kwargs)',

oh, its nothing, it was missing the tokenizer.model