Pinned Repositories
community
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
pipeline-sec-filings
Preprocessing pipeline notebooks and API supporting text extraction from SEC documents
UNS-MCP
unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
unstructured-api
unstructured-inference
unstructured-ingest
unstructured-js-client
A JavaScript/Typescript client for the Unstructured Platform API
unstructured-python-client
A Python client for the Unstructured Platform API
unstructured.PaddleOCR
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Unstructured's Repositories
Unstructured-IO/unstructured
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Unstructured-IO/unstructured-api
Unstructured-IO/unstructured-inference
Unstructured-IO/pipeline-sec-filings
Preprocessing pipeline notebooks and API supporting text extraction from SEC documents
Unstructured-IO/unstructured-python-client
A Python client for the Unstructured Platform API
Unstructured-IO/unstructured-ingest
Unstructured-IO/unstructured-js-client
A JavaScript/Typescript client for the Unstructured Platform API
Unstructured-IO/unstructured.PaddleOCR
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and deployment among server, mobile, embedded and IoT devices)
Unstructured-IO/UNS-MCP
Unstructured-IO/unstructured-api-tools
Unstructured-IO/pipeline-paddleocr
Pipeline for converting PDFs to raw text with PaddleOCR
Unstructured-IO/danswer
Gen-AI Chat for Teams - Think ChatGPT if it had access to your team's unique knowledge.
Unstructured-IO/langchain
⚡ Building applications with LLMs through composability ⚡
Unstructured-IO/pipeline-oer
Pipeline for extraction information from Army OERs
Unstructured-IO/pipeline-template
Unstructured-IO/docs
Documentation for all Unstructured products and libraries
Unstructured-IO/unstructured-mlk-archive-public
Unstructured-IO/unstructured-platform-plugins
Unstructured-IO/base-images
Store Dockerfiles and Packer configs for images to use as a base to build upon
Unstructured-IO/unstructured.pytesseract
A Python wrapper for Google Tesseract
Unstructured-IO/notebooks
Unstructured-IO/azure-ai-hub-gateway-solution-accelerator
Reference architecture that provides a set of guidelines and best practices for implementing a central AI API gateway to empower various line-of-business units in an organization to leverage Azure AI services
Unstructured-IO/model-cards
FedRAMP formatted model cards
Unstructured-IO/rag-over-hybrid-data-sources
Two sources (S3, ElasticSearch) to RAG DB pipeline.
Unstructured-IO/.github
Unstructured-IO/js-client-batch
JS Client Batch Processing
Unstructured-IO/aws-blog-post-example
Script to accompany the AWS blog post on unstructured data ETL with Unstructured Ingest library
Unstructured-IO/pairing-technical-challenge
Pairing Technical Challenge
Unstructured-IO/rag-over-evolving-enterprise-knowledge
Unstructured-IO/wolfi-dev-os
Main package repository for production Wolfi images