

zhuweigang opened this issue · 10 comments


  • OS: ubuntu 20.04
  • Python Version 3.9
  • NVIDIA RTX 3090

[2024-06-21 08:33:07,466][run][INFO] - Saving features into cached file /root/DeepKE/example/ee/standard/./data/DuEE/trigger/cached_dev_bert-base-chinese_256
[2024-06-21 08:33:08,084][run][INFO] - ***** Running evaluation *****
[2024-06-21 08:33:08,084][run][INFO] - Num examples = 1498
[2024-06-21 08:33:08,085][run][INFO] - Batch size = 16
[2024-06-21 08:33:08,085][run][INFO] - Mode = dev
Evaluating: 0%| | 0/94 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [2,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [3,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [4,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [5,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [6,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [7,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [8,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [9,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [10,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [11,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [12,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [13,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [14,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [15,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Evaluating: 0%| | 0/94 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/root/DeepKE/example/ee/standard/predict.py", line 115, in main
result, eval_pred_list = evaluate(args, model, eval_dataset, tokenizer, labels, pad_token_label_id, mode="dev", device=device)
File "/root/DeepKE/example/ee/standard/run.py", line 219, in evaluate
outputs = model(pad_token_label_id=pad_token_label_id, **inputs)
File "/root/anaconda3/envs/deepke/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/deepke/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/bert_crf.py", line 89, in forward
loss = self.crf.neg_log_likelihood(crf_logits, crf_mask, crf_labels)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/crf.py", line 273, in neg_log_likelihood
gold_score = self._score_sentence(scores, mask, tags)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/crf.py", line 258, in _score_sentence
tg_energy = tg_energy.masked_select(mask.transpose(1, 0))
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.


您好,当出现RuntimeError: CUDA error: device-side assert triggered这个错误的时候,一般是指最后分类头预测的维度和测试数据集中标签的数量不一致。
请您检查一下predict.yaml中您设置的任务类型(i.e. task_name参数中是trigger还是role)与所训练的模型是否一致,因为predict.yaml中默认的任务类型为role(trigger的eval在训练的过程中已经同时执行了),而train.yaml中设定的默认任务类型为trigger。如果您需要对role进行evaluation的话,需要在train.yaml中修改task_name为role,再另外训练一个模型后再进行evaluation。



===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/anaconda3/envs/deepke did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/lib64/libcudart.so.11.0
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
/root/anaconda3/envs/deepke/lib/python3.9/site-packages/hydra/plugins/config_source.py:190: UserWarning:
Missing @Package directive train.yaml in file:///root/DeepKE/example/ee/standard/conf.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive
warnings.warn(message=msg, category=UserWarning)
Traceback (most recent call last):
File "/root/DeepKE/example/ee/standard/predict.py", line 49, in main
args.dev_trigger_pred_file = os.path.join(args.cwd, args.dev_trigger_pred_file) if args.do_pipeline_predict and args.task_name=="role" else None
File "/root/anaconda3/envs/deepke/lib/python3.9/posixpath.py", line 90, in join
genericpath._check_arg_types('join', a, *p)
File "/root/anaconda3/envs/deepke/lib/python3.9/genericpath.py", line 152, in _check_arg_types
raise TypeError(f'{funcname}() argument must be str, bytes, or '
TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.



===================================BUG REPORT=================================== Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /root/anaconda3/envs/deepke did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths... warn(msg) CUDA SETUP: CUDA runtime path found: /usr/local/cuda-11.8/lib64/libcudart.so.11.0 CUDA SETUP: Highest compute capability among GPUs detected: 8.6 CUDA SETUP: Detected CUDA version 118 CUDA SETUP: Loading binary /root/anaconda3/envs/deepke/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda118.so... /root/anaconda3/envs/deepke/lib/python3.9/site-packages/hydra/plugins/config_source.py:190: UserWarning: Missing @Package directive train.yaml in file:///root/DeepKE/example/ee/standard/conf. See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/adding_a_package_directive warnings.warn(message=msg, category=UserWarning) Traceback (most recent call last): File "/root/DeepKE/example/ee/standard/predict.py", line 49, in main args.dev_trigger_pred_file = os.path.join(args.cwd, args.dev_trigger_pred_file) if args.do_pipeline_predict and args.task_name=="role" else None File "/root/anaconda3/envs/deepke/lib/python3.9/posixpath.py", line 90, in join genericpath._check_arg_types('join', a, *p) File "/root/anaconda3/envs/deepke/lib/python3.9/genericpath.py", line 152, in _check_arg_types raise TypeError(f'{funcname}() argument must be str, bytes, or ' TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType'

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

您好,从报错信息来看这里你修改参数后是直接运行的predict.py文件?您将train.yaml中的taskname改为role了之后需要再次运行python run.py,去训练一个事件元素抽取模型,我们在README中有提到对于事件抽取任务,需要训练两个阶段的模型。




  1. 我是把train.yaml里的taskname改为role后再运行了run.py后再运行了predict.py得到的这个报错结果
  2. predict.yaml中的dev_trigger_pred_file参数我没有动,还是./exp/DuEE/trigger/bert-base-chinese/eval_pred.json,而且这个文件是存在的


************** train.yaml ******************

data_name: DuEE # [ACE, DuEE]
model_name_or_path: bert-base-chinese # [bert-base-uncased, bert-base-chinese] english for ace, chinese for duee
#task_name: trigger # [trigger, role]
task_name: role
model_type: bertcrf
do_train: True
do_eval: True
do_predict: False # True for ACE, False for DuEE
labels: ""
config_name: ""
tokenizer_name: ""
cache_dir: ""
evaluate_during_training: True
do_lower_case: True
weight_decay: 0.0
learning_rate: 5e-5
adam_epsilon: 1e-8
per_gpu_train_batch_size: 16
per_gpu_eval_batch_size: 16
gradient_accumulation_steps: 1
max_seq_length: 256
max_grad_norm: 1.0
num_train_epochs: 5
max_steps: 500
warmup_steps: 0
logging_steps: 500
save_steps: 500
eval_all_checkpoints: False
no_cuda: False
n_gpu: 0
overwrite_output_dir: True
overwrite_cache: True
seed: 42
fp16: False
fp16_opt_level: "01"
local_rank: -1
data_dir: "" # parsing in run.py
tag_path: "" # parsing in run.py
output_dir: "" # parsing in run.py
dev_trigger_pred_file: null
test_trigger_pred_file: null

*************** predict.yaml ***************


  • train

data_name: DuEE # [ACE, DuEE]
model_name_or_path: ./exp/DuEE/role/bert-base-chinese
task_name: role # the trigger prediction is done during the training process.
do_train: False
do_eval: True
do_predict: False # True for ACE, False for DuEE

do_pipeline_predict: True
overwrite_cache: True

dev_trigger_pred_file: ./exp/DuEE/trigger/bert-base-chinese/eval_pred.json # change to your pred file of trigger classification
test_trigger_pred_file: ./exp/DuEE/trigger/bert-base-chinese/test_pred.json

您好,这里的报错为TypeError: join() argument must be str, bytes, or os.PathLike object, not 'NoneType',即args.cwd args.dev_trigger_pred_file中有对象为空,你可以尝试打个断点或者将相关变量打印出来看看,我在自己的环境下重新运行了一下,默认参数下predict.py L49并没有报错,可以看看在你的环境下变量错在哪儿。


[2024-06-24 17:34:19,321][deepke.event_extraction.standard.bertcrf.processor_ee][INFO] - LOOKING AT /root/DeepKE/example/ee/standard/./data/DuEE/role/dev_with_pred_trigger.tsv train
[2024-06-24 17:34:19,345][run][INFO] - Creating features from dataset file at /root/DeepKE/example/ee/standard/./data/DuEE/role
[2024-06-24 17:34:19,345][deepke.event_extraction.standard.bertcrf.processor_ee][INFO] - Writing example 0 of 2015
[2024-06-24 17:34:23,558][run][INFO] - Saving features into cached file /root/DeepKE/example/ee/standard/./data/DuEE/role/cached_dev_bert-base-chinese_256
[2024-06-24 17:34:24,415][run][INFO] - ***** Running evaluation *****
[2024-06-24 17:34:24,416][run][INFO] - Num examples = 2015
[2024-06-24 17:34:24,416][run][INFO] - Batch size = 16
[2024-06-24 17:34:24,416][run][INFO] - Mode = dev
Evaluating: 0%| | 0/126 [00:00<?, ?it/s]../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [2,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [3,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [4,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [5,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [6,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [7,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [8,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [9,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [10,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [11,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [12,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [13,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [14,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [15,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Evaluating: 0%| | 0/126 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/root/DeepKE/example/ee/standard/predict.py", line 124, in main
result, eval_pred_list = evaluate(args, model, eval_dataset, tokenizer, labels, pad_token_label_id, mode="dev", device=device)
File "/root/DeepKE/example/ee/standard/run.py", line 219, in evaluate
outputs = model(pad_token_label_id=pad_token_label_id, **inputs)
File "/root/anaconda3/envs/deepke/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/root/anaconda3/envs/deepke/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(*args, **kwargs)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/bert_crf.py", line 89, in forward
loss = self.crf.neg_log_likelihood(crf_logits, crf_mask, crf_labels)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/crf.py", line 273, in neg_log_likelihood
gold_score = self._score_sentence(scores, mask, tags)
File "/root/DeepKE/src/deepke/event_extraction/standard/bertcrf/crf.py", line 258, in _score_sentence
tg_energy = tg_energy.masked_select(mask.transpose(1, 0))
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
