content-extraction

There are 44 repositories under content-extraction topic.

  • mendableai/firecrawl-mcp-server

    Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.

    Language:JavaScript2.4k1920211
  • graphlit/graphlit-mcp-server

    Model Context Protocol (MCP) Server for Graphlit Platform

    Language:TypeScript3591021
  • currentslab/extractnet

    A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package

    Language:HTML29351524
  • mvasilkov/readability2

    Readability2 converts HTML to plain text.

    Language:TypeScript1089315
  • tuffstuff9/nextjs-pdf-parser

    Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.

    Language:TypeScript63126
  • gregors/boilerpipe-ruby

    Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles

    Language:Ruby43215
  • nikitautiu/learnhtml

    Web content extraction using machine learning

    Language:HTML33519
  • spences10/mcp-jinaai-reader

    🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader

    Language:JavaScript30133
  • oiwn/dom-content-extraction

    DOM Based Content Extraction via Text Density

    Language:Rust25162
  • gdamdam/sumo

    Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more

    Language:Python20215
  • pdfix/pdfix_sdk_example_cpp

    Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

    Language:C++20414
  • bencmc/youtube_video_summarizer

    This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.

    Language:Python14125
  • timoteostewart/benson

    Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!

    Language:Python14201
  • LandWhale2/TD-Spider

    Via Text Density Simple Web Crawler With Go

    Language:Go13200
  • peremenov/seize

    Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader

    Language:HTML12301
  • zeoagency/mobile-first-indexing-tool

    Mobile First Indexing Tool

    Language:Python12203
  • helioLJ/youtube-transcript-copier

    Chrome extension to copy YouTube transcripts with AI-friendly features

    Language:JavaScript8200
  • leroyanders/acrticle-scrapper

    This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…

    Language:Python5201
  • minarc/godensity

    This repository is implematation of 📄 DOM based content extraction via text density. Tested for Korean web pages.

    Language:Go5000
  • Solrikk/DataDigger

    DataDigger is a powerful and intuitive web application designed to extract and analyze data from web pages.

    Language:Go5100
  • arman-bd/www2any

    A web application that scrapes web pages, extracts main content, and uses OpenLLaMA to convert the content into specified formats.

    Language:HTML410
  • baughmann/tikara

    The metadata and text content extractor for almost every file type.

    Language:Python4190
  • amirthfultehrani/Youtube-Transcript-Copier

    A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.

    Language:JavaScript3100
  • newben420/gdelt_utility

    A web-based utility for fetching, categorizing, summarizing and managing global news and articles using the GDELT 2.0 API. Designed for content creators, news aggregators, and researchers, this tool simplifies access to up-to-date articles with an intuitive UI and customizable configurations.

    Language:JavaScript3100
  • pdfix/pdfix_sdk_example_node_js

    Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...

    Language:JavaScript3200
  • crawler

    rmwkwok/crawler

    Multi-process crawler which extracts main content and sustain itself by extracting more links to crawl.

    Language:Python3380
  • SbstnErhrdt/node-readability

    Simple node server to extract relevant content from website source code using Mozilla's Readability.js

    Language:JavaScript3100
  • SvenEichelsheimer/filegazer

    FileGazer - deep file analysing and categorisation

  • TypesetIO/jsuite

    Tools for parsing and manipulating JATS XML documents.

    Language:Python31212
  • dust-ai-mr/dust-html

    Dust library for html processing

    Language:Java2100
  • rithulkamesh/docproc

    Opinionated and Sophisticated Document Region Analyzer.

    Language:Python2190
  • Aish-p/WebScraperAPI

    WebScraperAPI is a powerful web application that transforms any website into structured data using the Firecrawl API. It provides an intuitive interface for extracting specific information from websites and converting it into structured formats like JSON and CSV.

    Language:Python10
  • thorkill/dbce

    Diff Based Content Extraction is a part of my Bachelor Thesis: Joint Approach to Boilerplate Detection in Web Archives

    Language:HTML1101
  • mrinshad/ChatPDF

    Document processing and querying system built with FastAPI and React. Upload documents and interact with their content using natural language queries powered by Gemini API and Unstructured.io

    Language:JavaScript00
  • simonpierreboucher/Crawler

    A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.

    Language:Python00
  • mlibre/Deep-Truth

    DeepTruth is your ultimate research buddy 🤖 that uses next-gen AI (via Ollama and Google Generative AI) to dig deep, extract exact quotes, and stitch them into a narrative. No fluff, just facts! 🔍🚀

    Language:JavaScript