ispras/dedoc

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

PythonApache-2.0

Issues

Проблема TabbyPdfError(Exception in thread "main" java.lang.OutOfMemoryError: Java heap space при парсинге документов
#489 opened 8 days ago by FatherOctber
4
Проблема в нумерации при чтении файла docx
#494 opened 8 days ago by ValiullinAlbert
2
Практические кейсы, для примеров блокнотов.
#484 opened 8 days ago by nikitaCodeSave
2
Однопоточность дедок-конетейнера при парсинге множества документов
#488 opened a month ago by FatherOctber
1
Ошибка в чтении файла
#478 opened 2 months ago by ValiullinAlbert
5
Ошибка в определении bold
#479 opened 2 months ago by ValiullinAlbert
1
Tables cells colors
#447 opened 4 months ago by Scoutink
4
LLM compatible json
#429 opened 4 months ago by arslan1510
4
Images
#437 opened 5 months ago by Scoutink
6
Ошибка в определении page_id
#410 opened 6 months ago by ValiullinAlbert
2
Обучение собственных моделей
#404 opened 7 months ago by ValiullinAlbert
3
Неправильное определение размера шрифта
#378 opened 9 months ago by ValiullinAlbert
2
Ошибка при прочитывании файла
#379 opened 9 months ago by ValiullinAlbert
2
Ошибки в выделении текста
#381 opened 9 months ago by ValiullinAlbert
2
Cannot extract tables using PdfTxtlayerReader
#373 opened 10 months ago by ValiullinAlbert
10
Complex html list extraction to xpath
#291 opened a year ago by trompx
2
Information from readers
#240 opened a year ago by NastyBoget
0
Add logical structure extractor from content information
#201 opened 2 years ago by oksidgy
0
add content extractor from PDFs with text layer
#202 opened 2 years ago by oksidgy
0
Add scanned reader for document images
#200 opened 2 years ago by oksidgy
0