pdf-extraction

There are 50 repositories under pdf-extraction topic.

  • Goldziher/kreuzberg

    Document intelligence framework for Python - Extract text, metadata, and structured data from PDFs, images, Office documents, and more. Built on Pandoc, PDFium, and Tesseract.

    Language:Python2.4k125398
  • signaturepdf

    24eme/signaturepdf

    Free open-source web software for signing PDF (alone or with others) and also organize pages, edit medata and compress pdf

    Language:JavaScript6341710872
  • pytr-org/pytr

    Use TradeRepublic in terminal and mass download all documents

    Language:Python58427110115
  • mupdf.js

    ArtifexSoftware/mupdf.js

    JavaScript bindings for MuPDF

  • mateogon/pdf-narrator

    Convert your PDFs and EPUBs into audiobooks effortlessly. Features intelligent text extraction, customizable text-to-speech settings, and efficient processing for low-resource systems.

    Language:Python11821219
  • iamarunbrahma/pdf-to-markdown

    Conversion of PDF documents to structured Markdown, optimized for Retrieval Augmented Generation (RAG) and other NLP tasks. Extract text, tables, and images with preserved formatting for enhanced information retrieval and processing.

    Language:Python94328
  • pcschreiber1/PDF_Extraction-Translation

    Translate many large PDF Reports for free using Python.

    Language:Jupyter Notebook332110
  • adobe/pdftools-extract-java-sdk-samples

    This sample project provides a preview of the PDF Extract API. Using the sample project and this documentation, you will easily be able to integrate the PDF Extract API in your own server-side code.

    Language:Java61104
  • aidalinfo/extract-kit

    Powerful PDF data extraction library powered by AI vision models. Transform PDFs into structured, validated data using TypeScript, Zod, and AI providers like Scaleway and Ollama.

    Language:TypeScript6
  • MarkShawn2020/video2ppt

    Extract presentation slides from videos with accurate timestamps

    Language:Shell6
  • anyparser/anyparserjs

    Anyparser Typescript SDK for RAG/ETL Pipelines - File Content Extraction. Supports extraction from various file formats including PDF, Microsoft Office documents, OCR/Image to Text, Audio to Text, and Website to Text.

    Language:TypeScript2100
  • heshiming/paddlefish

    A Python + C implementation for image-based PDF page layout analysis and content extraction.

    Language:C++2100
  • Amartya-007/Pdf-Reader

    Making an app so that we can read and extract information from prf easily or chat with our pdfs.

    Language:Python10
  • arv-fazriansyah/ekstrak-pdf-kartu-keluarga

    Ekstrak PDF Kartu Keluarga adalah aplikasi web berbasis React + Vite yang memanfaatkan Google Gemini API untuk mengekstrak data dari dokumen KK (PDF atau ZIP) secara otomatis, menampilkannya dalam tabel interaktif, dan mengekspor hasilnya ke Excel.

    Language:TypeScript1
  • Aumlo123/pdfdoom

    DOOM in a PDF (as ascii art)

  • billy-enrizky/pdf-extraction

    Scalable PDF Extraction using Multimodal GPT 4o

    Language:Python1
  • pdfAnalyzer

    bylickilabs/pdfAnalyzer

    PDF Analyzer** ist ein effizientes Python-Tool zur automatischen Analyse von PDF-Dokumenten.

    Language:Python1
  • heijul/pdf2gtfs

    A python tool to extract schedule data from PDF timetables and output it in GTFS.

    Language:Python1190
  • LorysHamadache/pdf2txt-multipage-extractor

    Fast batch tool to extract first-page text from all PDFs in a folder using Python. Optimized with multiprocessing to handle thousands of PDFs efficiently.

    Language:Python1
  • nickchristopherson/duluth-tourism-analysis

    End-to-End Data Pipeline for Tourism Industry Analysis

    Language:HTML1
  • RaghuSharma14/PDF-Reader

    A PDF Reader application powered by AI, allowing users to upload PDF documents and extract meaningful information using advanced NLP models. Built with Streamlit, Transformers, and Langchain, this app provides a seamless interface for interacting with and analyzing PDF content.

    Language:Python1
  • rrayhka/GRI-Extractor

    A tool to automatically extract GRI disclosure codes from corporate sustainability reports, enabling efficient analysis of environmental, social, and governance (ESG) data. Supports English and Indonesian reports.

    Language:Python1
  • souvik03-136/TenderBot

    Task

    Language:Python1
  • tracywong117/extract-info-from-pdf-paper

    This Python script uses pdfminer.six, PyPDF2, pdf2image to extract information (text, image) from pdf paper.

    Language:Python1111
  • vatsalmehta2001/MLPapers_scraper-summarizer

    A web application that scrapes ML research papers from arXiv and generates summaries using either OpenAI or Claude API.

    Language:Python1
  • cam-rodrigues/fydsync

    FidSync is a professional-grade web tool that helps financial teams extract fund statuses from PDF scorecards and update Excel templates accurately — without manual matching or formatting headaches. Built with Streamlit · PDF + Excel automation · Fuzzy matching · Secure and client-ready

    Language:Python
  • gazelle93/Various-Web-Text-Extraction-Methods

    This project is a command-line tool that extracts text from web pages and PDF files, including scanned documents. It supports various extraction methods. This tool is ideal for data scraping, NLP preprocessing, and content analysis.

    Language:Python
  • iodize6399/wwmai-copper-data

    Historical copper price data from WWMAI circulars. Raw PDFs and structured CSV data tracking electrolytic copper wire rod prices and calculation components.

  • Khanna-Aman/tesseract-invoice-ocr

    Python CLI tool for extracting structured data from scanned invoices using Tesseract OCR. Converts PDF/image invoices to CSV/JSON with vendor details, line items, and totals. Features robust error handling, batch processing, and professional-grade code quality.

    Language:Python
  • matheus-rech/systematic-review-extractor

    AI-powered systematic review data extraction system with zero hallucination guarantee

    Language:Python00
  • MohamedAziz15/MLOps-pipeline

    End-to-End LLMOps Pipeline

    Language:Jupyter Notebook
  • olympus-terminal/data-processing

    Data analysis and processing tools

    Language:Python
  • ozcanmiraay/opsbot

    AI-powered PDF extraction suite for structured insights from contracts, forms, and documents. Built with Streamlit, LangChain, GPT-4o, and PDFPlumber.

    Language:Python
  • RayenMalouche/MCP-PDF-Extractor-server

    A Java-based server leveraging Apache Tika to extract content and metadata from files (PDF, DOCX, TXT, etc.) in a local files-to-extract directory. Supports HTML (with CSS styling) and text extraction, file listing, and metadata retrieval via MCP-compliant tools and REST APIs. Built with Spring Boot, Jetty, and MCP SDK.

    Language:Java
  • sgrimee/waste-calendar-extractor

    Extract waste collection dates for the Luxemburgish commune of Niederanven from PDF calendars and generate iCal files.

    Language:Python
  • Vejandlachakrish/PersonaPrep-Persona-Aligned-Educational-PDF-Extractor

    Extracts and ranks example problems, derivations, and formulas from physics PDFs using PyMuPDF. Fully containerized with Docker, no network access required.

    Language:Python