RL trainig RuntimeError: CUDA error: device-side assert triggered
balcklive opened this issue · 2 comments
I run my reward training and actor training successfully. when I run RL training the program paused as below:
(py38) [xx@xx~]$ accelerate launch artifacts/main.py artifacts/config/config.yaml --type RL
Current device used :cuda
No previous model found at /home/zhoux/models/actor_model/actor for model gpt2-large.pt
No previous model found at /home/zhoux/models/critic_model/critic for model gpt2-large.pt
Initializing Critic from Reward model...
No previous model found at /home/zhoux/models/critic_model/reward for model gpt2-large.pt
Critic Model remains uninitialized
No previous model found at /home/zhoux/models/critic_model/critic for model gpt2-large.pt
No previous model found at /home/zhoux/models/reward_model/reward for model gpt2-large.pt
Start RL Training
Looking for checkpoints...
No previous checkpoint found at /home/zhoux/models/critic_model/checkpoints/critic for gpt2-large.pt
Looking for checkpoints...
No previous checkpoint found at /home/zhoux/models/actor_model/checkpoints/actor_rl for gpt2-large.pt
Clearing conversations log
Episode: 1/100, Timestep: 1/1 Learning Cnt: 1/1
Setting pad_token_id
to eos_token_id
:50256 for open-end generation.
and then got this error:
ad: [24,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize
failed.
Traceback (most recent call last):
File "artifacts/main.py", line 48, in
rlhf_trainer.train()
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 1092, in train
) = self.actorcritic.generate(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "<@beartype(chatllama.rlhf.trainer.ActorCritic.generate) at 0x7f45ae974790>", line 70, in generate
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 310, in generate
actions, sequences_actor = self.actor.generate(
File "<@beartype(chatllama.rlhf.actor.ActorModel.generate) at 0x7f45ae952a60>", line 51, in generate
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/chatllama/rlhf/actor.py", line 222, in generate
sequences = self.model.generate(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 1406, in generate
return self.greedy_search(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 2201, in greedy_search
outputs = self(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1075, in forward
transformer_outputs = self.transformer(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 899, in forward
outputs = block(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 389, in forward
attn_outputs = self.attn(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 330, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 200, in _attn
mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
my GPU is A100, and got 80G memory, my config.yaml is as below:
trainer_config:
learning rates
actor_lr: 0.00001
critic_lr: 0.00001
PPO Hyperparameters
actor_eps_clip: 0.2
critic_eps_clip: 0.2
beta_s: 0.02
path to examples to be sampled (training dataset) see rlhf_dataset.json
examples_path: "./datasets/rlhf_training_data.json"
number of episodes and generation performed for each episode
in the train() method
num_episodes: 100
max_timesteps: 1
number of timesteps after which the learn() method is called
(to update the weights)
update_timesteps: 1
number of example sampled at each timestep
num_examples: 8
batch and epochs for the training
batch_size: 1
epochs: 1
number of learning steps (i.e. learn()) after which a checkpoint is saved
checkpoint_steps: 5
checkpoint_name: null
actor_config:
model: "gpt2-large"
model_folder: "/home/zhoux/models/actor_model"
tokenizer_path: "/home/zhoux/models/actor_tokenizer"
train_dataset_path: "./datasets/actor_training_data.json"
validation_dataset_path: null
froze model embedding during training
froze_embeddings: True
use fairscale layers to build the model instead of vanilla pytorch
use_fairscale: False
max sequence length for the actor (i.e. prompt + completion) it depends on
the model used.
max_sequence_length: 2048
max tokens generated by the actor (completion only)
max_tokens: 2048
minimum number of tokens generated by the actor
min_tokens: 100
additional prompt tokens to be used for template or as safety
additonal_prompt_tokens: 20
temperature for the actor
temperature: 0.5
batch_size: 1
number iteration after print
iteration_per_print: 1
lr: 0.000009
epochs: 2
number of backpropagation after saving the checkpoints
checkpoint_steps: 10000000000
number of checkpoints to keep while removing the older
(keep memory consumption of checkpoints reasonable)
n_checkpoints_to_keep: 5
here specify the name of the actor checkpoint from which resume
during actor training. If null load the last one.
checkpoint_name: null
deepspeed settings
deepspeed_enable: False
deepspeed_config_path: "./artifacts/config/ds_config.json"
accelerate settings
accelerate_enable: False
reward_config:
model to be chosen are gp2-large, bart-base, longformer-base-4096
more can be simply added in the reward.py init()
model: "gpt2-large"
model_folder: "/home/zhoux/models/reward_model"
hidden size of the additional ffw head to produce the scores
model_head_hidden_size: 2048
max_sequence_length: 1024
train_dataset_path: "./datasets/reward_training_data.json"
validation_dataset_path: null
batch_size: 1
epochs: 2
iteration_per_print: 1
steps after which the checkpoint are saved
checkpoint_steps: 10000000000
here specify the name of the reward checkpoint from which resume
during reward training. If null load the last one.
checkpoint_name: null
lr: 0.000009
deepspeed settings
deepspeed_enable: False
deepspeed_config_path: "./artifacts/config/ds_config.json"
accelerate settings
accelerate_enable: False
critic_config:
model to be chosen are gp2-large, bart-base, longformer-base-4096
more can be simply added in the reward.py init()
model: "gpt2-large"
hidden size of the additional ffw head to produce the scores
model_head_hidden_size: 2048
max_sequence_length: 1024
model_folder: "/home/zhoux/models/critic_model"
here specify the name of the critic checkpoint from which resume
during critic training. If null load the last one.
checkpoint_name: null
what is wrong with it? can't figure out why...
Hi @balcklive, thanks for reaching out,
The problem probably could be solved by lowering slightly the param
Max_sequence_length in the config.yaml
Let me know if this solves your problems.
Yes, I lower my max_sequence_length of Actor_model from 2048 to 1024, then it works! thank you !