nebuly-ai/optimate

RL trainig RuntimeError: CUDA error: device-side assert triggered

balcklive opened this issue · 2 comments

I run my reward training and actor training successfully. when I run RL training the program paused as below:

(py38) [xx@xx~]$ accelerate launch artifacts/main.py artifacts/config/config.yaml --type RL
Current device used :cuda
No previous model found at /home/zhoux/models/actor_model/actor for model gpt2-large.pt
No previous model found at /home/zhoux/models/critic_model/critic for model gpt2-large.pt
Initializing Critic from Reward model...
No previous model found at /home/zhoux/models/critic_model/reward for model gpt2-large.pt
Critic Model remains uninitialized
No previous model found at /home/zhoux/models/critic_model/critic for model gpt2-large.pt
No previous model found at /home/zhoux/models/reward_model/reward for model gpt2-large.pt
Start RL Training
Looking for checkpoints...
No previous checkpoint found at /home/zhoux/models/critic_model/checkpoints/critic for gpt2-large.pt
Looking for checkpoints...
No previous checkpoint found at /home/zhoux/models/actor_model/checkpoints/actor_rl for gpt2-large.pt
Clearing conversations log
Episode: 1/100, Timestep: 1/1 Learning Cnt: 1/1
Setting pad_token_id to eos_token_id:50256 for open-end generation.

and then got this error:
ad: [24,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [25,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [26,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [27,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [28,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [29,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [30,0,0] Assertion srcIndex < srcSelectDimSize failed.
../aten/src/ATen/native/cuda/Indexing.cu:1093: indexSelectSmallIndex: block: [7,0,0], thread: [31,0,0] Assertion srcIndex < srcSelectDimSize failed.
Traceback (most recent call last):
File "artifacts/main.py", line 48, in
rlhf_trainer.train()
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 1092, in train
) = self.actorcritic.generate(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "<@beartype(chatllama.rlhf.trainer.ActorCritic.generate) at 0x7f45ae974790>", line 70, in generate
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/chatllama/rlhf/trainer.py", line 310, in generate
actions, sequences_actor = self.actor.generate(
File "<@beartype(chatllama.rlhf.actor.ActorModel.generate) at 0x7f45ae952a60>", line 51, in generate
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/chatllama/rlhf/actor.py", line 222, in generate
sequences = self.model.generate(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 1406, in generate
return self.greedy_search(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/generation/utils.py", line 2201, in greedy_search
outputs = self(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1075, in forward
transformer_outputs = self.transformer(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 899, in forward
outputs = block(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 389, in forward
attn_outputs = self.attn(
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 330, in forward
attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
File "/home/zhoux/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 200, in _attn
mask_value = torch.full([], mask_value, dtype=attn_weights.dtype).to(attn_weights.device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

my GPU is A100, and got 80G memory, my config.yaml is as below:

trainer_config:

learning rates

actor_lr: 0.00001
critic_lr: 0.00001

PPO Hyperparameters

actor_eps_clip: 0.2
critic_eps_clip: 0.2
beta_s: 0.02

path to examples to be sampled (training dataset) see rlhf_dataset.json

examples_path: "./datasets/rlhf_training_data.json"

number of episodes and generation performed for each episode

in the train() method

num_episodes: 100
max_timesteps: 1

number of timesteps after which the learn() method is called

(to update the weights)

update_timesteps: 1

number of example sampled at each timestep

num_examples: 8

batch and epochs for the training

batch_size: 1
epochs: 1

number of learning steps (i.e. learn()) after which a checkpoint is saved

checkpoint_steps: 5
checkpoint_name: null

actor_config:
model: "gpt2-large"
model_folder: "/home/zhoux/models/actor_model"
tokenizer_path: "/home/zhoux/models/actor_tokenizer"
train_dataset_path: "./datasets/actor_training_data.json"
validation_dataset_path: null

froze model embedding during training

froze_embeddings: True

use fairscale layers to build the model instead of vanilla pytorch

use_fairscale: False

max sequence length for the actor (i.e. prompt + completion) it depends on

the model used.

max_sequence_length: 2048

max tokens generated by the actor (completion only)

max_tokens: 2048

minimum number of tokens generated by the actor

min_tokens: 100

additional prompt tokens to be used for template or as safety

additonal_prompt_tokens: 20

temperature for the actor

temperature: 0.5
batch_size: 1

number iteration after print

iteration_per_print: 1
lr: 0.000009
epochs: 2

number of backpropagation after saving the checkpoints

checkpoint_steps: 10000000000

number of checkpoints to keep while removing the older

(keep memory consumption of checkpoints reasonable)

n_checkpoints_to_keep: 5

here specify the name of the actor checkpoint from which resume

during actor training. If null load the last one.

checkpoint_name: null

deepspeed settings

deepspeed_enable: False
deepspeed_config_path: "./artifacts/config/ds_config.json"

accelerate settings

accelerate_enable: False

reward_config:

model to be chosen are gp2-large, bart-base, longformer-base-4096

more can be simply added in the reward.py init()

model: "gpt2-large"
model_folder: "/home/zhoux/models/reward_model"

hidden size of the additional ffw head to produce the scores

model_head_hidden_size: 2048
max_sequence_length: 1024
train_dataset_path: "./datasets/reward_training_data.json"
validation_dataset_path: null
batch_size: 1
epochs: 2
iteration_per_print: 1

steps after which the checkpoint are saved

checkpoint_steps: 10000000000

here specify the name of the reward checkpoint from which resume

during reward training. If null load the last one.

checkpoint_name: null
lr: 0.000009

deepspeed settings

deepspeed_enable: False
deepspeed_config_path: "./artifacts/config/ds_config.json"

accelerate settings

accelerate_enable: False

critic_config:

model to be chosen are gp2-large, bart-base, longformer-base-4096

more can be simply added in the reward.py init()

model: "gpt2-large"

hidden size of the additional ffw head to produce the scores

model_head_hidden_size: 2048
max_sequence_length: 1024
model_folder: "/home/zhoux/models/critic_model"

here specify the name of the critic checkpoint from which resume

during critic training. If null load the last one.

checkpoint_name: null

what is wrong with it? can't figure out why...

Hi @balcklive, thanks for reaching out,
The problem probably could be solved by lowering slightly the param
Max_sequence_length in the config.yaml
Let me know if this solves your problems.

Yes, I lower my max_sequence_length of Actor_model from 2048 to 1024, then it works! thank you !