document-analysis
There are 89 repositories under document-analysis topic.
opendatalab/MinerU
A one-stop, open-source, high-quality data extraction tool, supports PDF/webpage/e-book extraction.一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取。
UglyToad/PdfPig
Read and extract text and other content from PDFs in C# (port of PDFBox)
AlibabaResearch/AdvancedLiterateMachinery
A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.
tstanislawek/awesome-document-understanding
A curated list of resources for Document Understanding (DU) topic
Yuliang-Liu/Curve-Text-Detector
This repository provides train&test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.
wenwenyu/PICK-pytorch
Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)
jpWang/LiLT
Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)
pandora-analysis/pandora
Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results
CybercentreCanada/assemblyline
AssemblyLine 4: File triage and malware analysis
lazyFrogLOL/llmdocparser
A package for parsing PDFs and analyzing their content using LLMs.
ispras/dedoc
Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser
masyagin1998/robin
RObust document image BINarization
chriswolfvision/local_adaptive_binarization
Local adaptive image binarization
anisha2102/docvqa
Document Visual Question Answering
mirabdullahyaser/Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit
Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.
aws-samples/amazon-textract-transformer-pipeline
Post-process Amazon Textract results with Hugging Face transformer models for document understanding
monniert/docExtractor
(ICFHR 2020 oral) Code for "docExtractor: An off-the-shelf historical document element extraction" paper
Xyntopia/pydoxtools
Effortlessly extract information from unstructured data with this library, utilizing advanced AI techniques. Compose AI in customizable pipelines and diverse sources for your projects.
ppaanngggg/yolo-doclaynet
YOLO models trained by DocLayNet - power your Document Intelligent by Layout Analysis
ZeningLin/ViBERTgrid-PyTorch
An unofficial PyTorch implementation of "Lin et al. ViBERTgrid: A Jointly Trained Multi-Modal 2D Document Representation for Key Information Extraction from Documents. ICDAR, 2021"
JPLeoRX/detectron2-publaynet
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
abdur75648/UTRNet-High-Resolution-Urdu-Text-Recognition
UTRNet: High-Resolution Urdu Text Recognition In Printed Documents (ICDAR'23)
ankanbhunia/AdverseBiNet
Improving Document Binarization via Adversarial Noise-Texture Augmentation (ICIP 2019)
AILab-UniFI/GNN-TableExtraction
Code for ICPR2022 paper: "Graph Neural Networks and Representation Embedding for table extraction in PDF Documents"
aws-solutions/enhanced-document-understanding-on-aws
Enhanced Document Understanding on AWS delivers an easy-to-use web application that ingests and analyzes documents, extracts content, identifies and redacts sensitive customer information, and creates search indexes from the analyzed data.
swapnil-ahlawat/Document_Layout_Analysis-MonkAI
DL models that take a document image file as input, locate the position of paragraphs, lines, images, etc. with their labels and confidence scores.
muhd-umer/pyramidtabnet
Official PyTorch implementation of PyramidTabNet: Transformer-based Table Recognition in Image-based Documents
CaseDrive/publaynet-models
Trained Detectron2 object detection models for document layout analysis based on PubLayNet dataset
microsoft/synthetic-rag-index
Service to import data from various sources and index it in AI Search. Increases data relevance and reduces final size by 90%+. Useful for RAG scenarios with LLM. Hosted in Azure with serverless architecture.
ihdia/docvisor
An open-source tool for visualisation of outputs of deep-learning models for document analysis tasks such as fully automatic, bounding box and OCR.
huyhoang17/kuzushiji_recognition
[Late Submission] Solution for Kuzushiji recognition (Kaggle competition)
bookalope/InDesign-CEP
Adobe CEP extension for InDesign to use the Bookalope cloud services. You can download the extension from Adobe Exchange.
ad-freiburg/pdftotext-plus-plus
A fast and accurate command line tool for extracting text from PDF files.
bookalope/Bookalope
Everything related to Bookalope and its REST API.
TUWien/ReadModules
CVL/READ Modules including Basic Layout Analysis and Writer Identification/Retrieval
pleb631/PdfDet
PdfDet aims to simplify PDF layout detect tasks for users.