latex-ocr换数据集后报错`index_value >= 0 && index_value < input_dims[j]` failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [505] and greater

Question

latex-ocr换数据集后报错`index_value >= 0 && index_value < input_dims[j]` failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [505] and greater

lm91888 opened this issue a month ago · 12 comments

🔎 Search before asking

I have searched the PaddleOCR Docs and found no similar bug report.
I have searched the PaddleOCR Issues and found no similar bug report.
I have searched the PaddleOCR Discussions and found no similar bug report.

🐛 Bug (问题描述)

换了新的数据集后报错Error: ../paddle/phi/kernels/funcs/gather.cu.h:60 Assertion index_value >= 0 && index_value < input_dims[j] failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [505] and greater than or equal to 0, but received [0]
max_seq_len依然是512，
errmix.log

🏃‍♂️ Environment (运行环境)

paddlepaddle 2.6.2
paddlepaddle-gpu 3.0.0b1
pandas 2.0.3
python 3.8.18

🌰 Minimal Reproducible Example (最小可复现问题的Demo)

W1130 10:25:37.558496 4138159 gpu_resources.cc:119] Please NOTE: device: 0, GPU Compute Capability: 8.6, Driver API Version: 12.4, Runtime API Version: 12.3
W1130 10:25:37.572280 4138159 gpu_resources.cc:164] device: 0, cuDNN Version: 9.0.
Traceback (most recent call last):
File "tools/train.py", line 269, in
main(config, device, logger, vdl_writer, seed)
File "tools/train.py", line 222, in main
program.train(
File "/home/xk/zlm/PaddleOCR/tools/program.py", line 354, in train
preds = model(batch)
File "/usr/local/anaconda3/envs/zlm/lib/python3.8/site-packages/paddle/nn/layer/layers.py", line 1426, in call
return self.forward(*inputs, **kwargs)
File "/home/xk/zlm/PaddleOCR/ppocr/modeling/architectures/base_model.py", line 99, in forward
x = self.head(x, targets=data)
File "/usr/local/anaconda3/envs/zlm/lib/python3.8/site-packages/paddle/nn/layer/layers.py", line 1426, in call
return self.forward(*inputs, **kwargs)
File "/home/xk/zlm/PaddleOCR/ppocr/modeling/heads/rec_latexocr_head.py", line 1017, in forward
kwargs = {"context": encoded_feat, "mask": mask.cast(paddle.bool)}
File "/usr/local/anaconda3/envs/zlm/lib/python3.8/site-packages/paddle/tensor/manipulation.py", line 207, in cast
return _C_ops.cast(x, dtype)
OSError: (External) CUDA error(719), unspecified launch failure.
[Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at ../paddle/phi/backends/gpu/cuda/cuda_info.cc:272)

Answer 1 · 2024-11-30T02:35:14.000Z

根据您描述的问题和日志信息，index_value >= 0 && index_value < input_dims[j] 报错的原因可能是因为换数据集后，输入的索引超出了模型的维度范围。这种情况通常与以下问题之一直接相关：

1. 数据集处理的问题

问题描述：新的数据集中可能存在标签或输入数据不符合模型的预期格式。
解决方案：
1. 检查数据集的标签文件，确保每个数据的维度、索引值以及序列长度都在模型配置的范围内。
  - 例如，报错中提到的 index_value 超出了 [505] 的范围，可能是因为数据中的索引值（如字符映射的索引）大于模型的 vocab_size。
2. 如果您修改了 max_seq_len 参数，确保数据集中所有样本的序列长度均小于该值。
3. 验证字符字典（char_dict_path）是否与新数据集的字符集匹配。如果不匹配，可能会导致索引超出范围。

2. 字符字典（Vocabulary）与数据集不匹配

问题描述：新数据集中可能包含字符字典中未定义的字符，导致解析时索引越界。
解决方案：
1. 检查并更新 rec_char_dict_path 配置，确保字典中包含所有新数据集的字符。
2. 如果不确定字符集内容，可以从数据集中提取所有字符，生成一个新的字符字典。
3. 更新 model_config.yml 中的 character_dict_path，并重新训练模型。

3. 配置文件（Config）与数据集不匹配

问题描述：新数据集的某些属性（如图片尺寸、序列长度等）可能与配置不兼容。
解决方案：
1. 检查配置文件中与数据相关的参数（如 image_shape、max_seq_len、vocab_size 等），确保与新数据集保持一致。
2. 确认 max_seq_len 设置为 512 的情况下，数据集中的序列长度未超过 512。

4. 调试建议

为了快速定位问题，可以采取以下调试步骤：

打印数据样本和索引信息：
- 在模型输入前打印数据样本的形状和索引值，确认是否存在超过设定范围的索引。
- 示例代码：
```
def debug_input(data):
    print(f"Input shape: {data['image'].shape}")
    print(f"Label indices: {data['label_indices']}")
```
检查数据加载器（DataLoader）：
- 检查数据预处理步骤，确认数据是否正确归一化、裁剪并符合模型输入要求。
使用 PaddleOCR 提供的工具验证数据集：
- PaddleOCR 提供了数据验证工具，执行以下命令检查数据集是否有问题：
```
python tools/rec_tools.py -c configs/rec/rec_latex_config.yml -o Global.check_dataset=True
```

5. 参考相关讨论

在 PaddleOCR 的 Issue 和 Discussions 中，有一些类似的报错案例，可能对您有帮助：

相关讨论 1：训练时报错 IndexError: list index out of range
- 问题：训练时部分数据会报错，原因是框的坐标或标签文件格式错误。
- 解决方案：检查训练数据的标签文件格式，确保没有越界或缺失值。
相关讨论 2：表格识别出现 list index out of range
- 问题：某些数据在处理边界时越界。
- 解决方案：检查数据预处理代码是否对输入数据进行了充分的边界检查。
相关讨论 3：表格识别报错 IndexError: list index out of range
- 问题：标签解析时索引越界。
- 解决方案：检查标签文件和 char_dict_path 的匹配情况。

总结

从日志可以看出，问题主要集中在数据集与模型输入的维度和索引范围不匹配上。建议您按照以下步骤依次排查：

检查数据集格式和标签文件是否符合 PaddleOCR 的要求。
确保 char_dict_path 与数据集的字符集一致。
确认 max_seq_len 和 vocab_size 设置是否合理。
使用调试工具定位具体出错的数据样本并修复。

如果问题仍未解决，请提供更多数据处理代码或配置文件，以便进一步分析。

Response generated by 🤖 feifei-bot | chatgpt-4o-latest

Answer 2 · 2024-11-30T03:18:59.000Z

换一下电脑试试呢