ispras/dedoc
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
PythonApache-2.0
Issues
- 4
Проблема TabbyPdfError(Exception in thread "main" java.lang.OutOfMemoryError: Java heap space при парсинге документов
#489 opened by FatherOctber - 2
Проблема в нумерации при чтении файла docx
#494 opened by ValiullinAlbert - 2
Практические кейсы, для примеров блокнотов.
#484 opened by nikitaCodeSave - 1
- 5
Ошибка в чтении файла
#478 opened by ValiullinAlbert - 1
Ошибка в определении bold
#479 opened by ValiullinAlbert - 4
Tables cells colors
#447 opened by Scoutink - 4
LLM compatible json
#429 opened by arslan1510 - 6
- 2
Ошибка в определении page_id
#410 opened by ValiullinAlbert - 3
Обучение собственных моделей
#404 opened by ValiullinAlbert - 2
Неправильное определение размера шрифта
#378 opened by ValiullinAlbert - 2
Ошибка при прочитывании файла
#379 opened by ValiullinAlbert - 2
Ошибки в выделении текста
#381 opened by ValiullinAlbert - 10
Cannot extract tables using PdfTxtlayerReader
#373 opened by ValiullinAlbert - 2
Complex html list extraction to xpath
#291 opened by trompx - 0
Information from readers
#240 opened by NastyBoget - 0
- 0
add content extractor from PDFs with text layer
#202 opened by oksidgy - 0
Add scanned reader for document images
#200 opened by oksidgy