text-extraction

There are 357 repositories under text-extraction topic.

  • adbar/trafilatura

    Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

    Language:Python4.7k33409310
  • miso-belica/sumy

    Module for automatic summarization of text documents and HTML pages.

    Language:Python3.6k114125535
  • unidoc/unipdf

    Golang PDF library for creating and processing PDF files (pure go)

    Language:Go2.9k29322275
  • Goldziher/kreuzberg

    Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

    Language:Python2.4k125197
  • chrismattmann/tika-python

    Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

    Language:Python1.6k40295244
  • whitelok/image-text-localization-recognition

    A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約

  • miso-belica/jusText

    Heuristic based boilerplate removal tool

    Language:Python795203085
  • unidoc/unidoc

    This repository has moved! https://github.com/unidoc/unipdf

    Language:Go70916086
  • datashare

    ICIJ/datashare

    A self‑hosted search engine for documents. Help us improve Datashare by answering a survey on structured content: https://forms.gle/PYgusFsoBaMyzUec9

    Language:Java657281.6k63
  • ropensci/pdftools

    Text Extraction, Rendering and Converting of PDF Documents

    Language:C++5382811372
  • cdown/srt

    A simple library and set of tools for parsing, modifying, and composing SRT files.

    Language:Python522168848
  • iamarunbrahma/vision-parse

    Parse PDFs into markdown using Vision LLMs

    Language:Python42841759
  • flairNLP/fundus

    A very simple news crawler with a funny name

    Language:Python401712590
  • shixzie/nlp

    [UNMANTEINED] Extract values from strings and fill your structs with nlp.

    Language:Go38920932
  • pd3f

    pd3f/pd3f

    🏭 PDF text extraction pipeline: self-hosted, local-first, Docker-based

    Language:HTML32772239
  • py-pdf/benchmarks

    Benchmarking PDF libraries

    Language:Python3105918
  • Goldziher/html-to-markdown

    HTML to markdown converter

    Language:Python23423
  • bookieio/breadability

    Reworked https://www.readability.com/ parsing library (now https://mercury.postlight.com/ is living alternative)

    Language:HTML205202325
  • weareprestatech/hotpdf

    hotpdf is a fast PDF parsing library to extract text and find text within PDF documents built on top of pdfminer.six

    Language:Python1963239
  • SapienzaNLP/extend

    Entity Disambiguation as text extraction (ACL 2022)

    Language:Python18251113
  • skylander86/lambda-text-extractor

    AWS Lambda functions to extract text from various binary formats.

    Language:Python1779544
  • vsymbol/CUTIE

    CUTIE (TensorFlow implementation of Convolutional Universal Text Information Extractor)

    Language:Python157161577
  • archivesunleashed/aut

    The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

    Language:Scala1471427334
  • sambitdash/PDFIO.jl

    PDF Reader Library for Native Julia.

    Language:Julia13448115
  • vaites/php-apache-tika

    Apache Tika bindings for PHP: extract text and metadata from documents, images and other formats

    Language:PHP11652724
  • ocr

    victorqribeiro/ocr

    Simple app to extract text from pictures using Tesseract

    Language:HTML106209
  • lu4p/cat

    Extract text from plaintext, .docx, .odt and .rtf files. Pure go.

    Language:Go10451017
  • iamarunbrahma/pdf-to-markdown

    Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

    Language:Python94327
  • jmriebold/BoilerPy3

    Python port of Boilerpipe library

    Language:Python934418
  • docwire/docwire

    DocWire SDK: Award-winning modern data processing in C++20. SourceForge Community Choice & Microsoft support. AI-driven processing. Supports nearly 100 data formats, including email boxes and OCR. Boost efficiency in text extraction, web data extraction, data mining, document analysis. Offline processing is possible for security and confidentiality

    Language:C++9266622
  • nainiayoub/pdf-text-data-extractor

    PDF text data extraction web app with OCR for scanned documents

    Language:Python884450
  • gamemaker1/office-text-extractor

    Yet another library to extract text from MS Office and PDF files

    Language:TypeScript812157
  • ckorzen/pdf-text-extraction-benchmark

    A project about benchmarking and evaluating existing PDF extraction tools on their semantic abilities to extract the body texts from PDF documents, especially from scientific articles.

    Language:TeX696211
  • JonathanRaiman/wikipedia_ner

    :book: Labeled examples from wiki dumps in Python

    Language:Jupyter Notebook67327
  • iscc/mobi

    python based software to unpack kindlegen generated ebooks

    Language:Python661139
  • abhinaba-ghosh/any-text

    Get text content from any file

    Language:JavaScript6421611