opendatalab/MinerU

Repeated escape of '<' '>' symbols in html table

Closed this issue · 2 comments

🔎 Search before asking | 提交之前请先搜索

  • I have searched the MinerU Readme and found no similar bug report.
  • I have searched the MinerU Issues and found no similar bug report.
  • I have searched the MinerU Discussions and found no similar bug report.

🤖 Consult the online AI assistant for assistance | 在线 AI 助手咨询

Description of the bug | 错误描述

When parsing table include '<' '>' symbols,VLM mode while repeated escape it by '&lt;' and '&gt'

How to reproduce the bug | 如何复现

The badcase file is:badcase.pdf

Image

MinerU result:

Image

Operating System Mode | 操作系统类型

Linux

Operating System Version| 操作系统版本

Ubuntu 22.04

Python version | Python 版本

3.10

Software version | 软件版本 (mineru --version)

2.0.x

Device mode | 设备模式

cuda

Hi @Doge2077! I'm Dosu and I’m helping the MinerU team.

This repeated escaping of < and > in HTML tables in MinerU VLM mode was a known issue in version 2.0.x. The root cause is that the escaping function was applied multiple times to already-escaped content, resulting in &amp;lt; and &amp;gt; instead of the correct &lt; and &gt;. This was fixed in MinerU 2.1.6 by refactoring the VLM backend to use a new function (block_content_to_html) for table HTML conversion, which prevents repeated escaping. The patch also improved table HTML extraction logic.

To resolve this, upgrade MinerU to at least version 2.1.6. If upgrading is not possible, you can manually apply the patch by updating the VLM backend code to use block_content_to_html for table content conversion and ensure only the <table>...</table> portion is extracted for table results. Details of the fix and code changes are in the release PR.

MinerU 2.0.x does not contain this fix, so the issue will persist until you upgrade or patch. If you need help with the manual patch, let me know!

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other  Chat with Dosu Join Discord Share on X

Image

It has been resolved on 2.5.