opendatalab/MinerU
A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
PythonAGPL-3.0
Pinned issues
Issues
- 3
layout.pdf结果和layout模型结果不一致
#590 opened by Schumpeterx - 0
magic-pdf -p demo1.pdf Illegal instruction
#591 opened by Joyouspeng - 12
分页并行以及模块化处理
#570 opened by QIN2DIM - 5
能否在版本发布时候,同时发布更新一个相应版本的docker镜像呢?
#583 opened by DreamTeamWangbowen - 1
【求更新】UniMERNet已更新,期望更新。
#589 opened by CocoaML - 2
解析后的markdown中的标题,没有区分一级标题,二级标题,三级标题,都是一样的
#506 opened by wzhiqing - 2
能提供远程IP地址+API key使用的功能吗?
#585 opened by llity - 3
pdf中的表格识别,怎么将表格解析为纯文本而不是图片的形式存放在pdf解析生成的md文件中?
#531 opened by SuperDZ - 3
Unable to handle large files
#576 opened by Sg4Dylan - 3
- 2
求助docker相关
#575 opened by meng0423 - 33
安装版本为0.6.1 而不是0.7.1
#556 opened by James-Dao - 3
模型预加载参数错误,大佬救命🆘🆘🆘
#571 opened by FHhui - 2
整体的识别进度显示,完成多少页转换。
#541 opened by TastSong - 3
正文中的段落有时候被丢掉了。
#569 opened by wooemans - 4
ocr解析pdf,部分pdf会出现乱码问题
#558 opened by stormchen-cell - 1
如何输出bbox框选不同元素的pdf
#566 opened by jujulovesstudying - 0
内网离线运行总是提示发送request失败,里面是有什么依赖要手动下载吗?
#565 opened by DreamTeamWangbowen - 2
报错torch版本和cuda版本不匹配
#563 opened by ynzm233 - 1
解析PDF得到的content_list中标题只有一级
#562 opened by littlexiaoyou - 6
对于整个页面大部分为空白,只有几个字符的PDF页面, 无法识别
#559 opened by coocoocooee - 5
MinerU和marker解析pdf能力对比
#551 opened by Sakura4036 - 0
推理显存占用很高
#561 opened by pandaominggz - 2
Table caption recognize error | 表格标题识别错误
#539 opened by Schumpeterx - 0
是否可以将所有模型统一转换为onnx版本,这样可以大大减少工程依赖而且可以降低系统复杂度
#557 opened by ConleyKong - 9
Bus error (core dumped)
#553 opened by CloudAndMist - 0
微信群的链接加不进去啦
#549 opened by duanyu - 6
开启表格解析后依然有大部分表格未被解析
#548 opened by mingyonga8 - 1
- 0
docker部署后怎么使用有文档么?
#546 opened by singeleaf - 3
升级magic pdf最新版本后出现Segmentation fault
#543 opened by randydl - 0
在线体验端pdf识别结果问题
#545 opened by X17exe - 2
英文部分检测乱码
#538 opened by clareliu1234 - 5
- 1
- 4
'fairscale'模块不存在
#535 opened by yang123456he - 0
公式检测和公式识别不准确
#537 opened by zhangmain666 - 3
模型预加载
#517 opened by BronyaKaslana06 - 2
求帮助:识别时输出kill,然后识别中断
#525 opened by meng0423 - 2
latex识别错误
#515 opened by Barmaid1076 - 1
magic_pdf_parse_main.py的最佳配置
#516 opened by HaoRenkk123 - 1
The model repeatedly initializes when processing multiple PDFs in a single process, and it does not implement a singleton pattern.
#502 opened by drunkpig - 0
标题和图片的问题
#520 opened by luocongqiu - 1
在magic_pdf_parse_main这个demo中,如何才能批量处理PDF文件
#513 opened by chenliutiao - 0
- 2
使用small_ocr.pdf实验,解析结果为空
#509 opened by huhk-sysu - 0
pdf解析报错 segmentation fault
#505 opened by audio-github-2020 - 2
- 0
表格上出现页眉会被识别为正文
#498 opened by kakaxisisan - 0
pdf识别不出图片
#497 opened by Ceceliachenen