content-extraction
There are 44 repositories under content-extraction topic.
mendableai/firecrawl-mcp-server
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
graphlit/graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
currentslab/extractnet
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
mvasilkov/readability2
Readability2 converts HTML to plain text.
tuffstuff9/nextjs-pdf-parser
Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.
gregors/boilerpipe-ruby
Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles
nikitautiu/learnhtml
Web content extraction using machine learning
spences10/mcp-jinaai-reader
🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader
oiwn/dom-content-extraction
DOM Based Content Extraction via Text Density
gdamdam/sumo
Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more
pdfix/pdfix_sdk_example_cpp
Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
bencmc/youtube_video_summarizer
This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.
timoteostewart/benson
Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!
LandWhale2/TD-Spider
Via Text Density Simple Web Crawler With Go
peremenov/seize
Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader
zeoagency/mobile-first-indexing-tool
Mobile First Indexing Tool
helioLJ/youtube-transcript-copier
Chrome extension to copy YouTube transcripts with AI-friendly features
leroyanders/acrticle-scrapper
This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…
minarc/godensity
This repository is implematation of 📄 DOM based content extraction via text density. Tested for Korean web pages.
Solrikk/DataDigger
DataDigger is a powerful and intuitive web application designed to extract and analyze data from web pages.
arman-bd/www2any
A web application that scrapes web pages, extracts main content, and uses OpenLLaMA to convert the content into specified formats.
baughmann/tikara
The metadata and text content extractor for almost every file type.
amirthfultehrani/Youtube-Transcript-Copier
A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.
newben420/gdelt_utility
A web-based utility for fetching, categorizing, summarizing and managing global news and articles using the GDELT 2.0 API. Designed for content creators, news aggregators, and researchers, this tool simplifies access to up-to-date articles with an intuitive UI and customizable configurations.
pdfix/pdfix_sdk_example_node_js
Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
rmwkwok/crawler
Multi-process crawler which extracts main content and sustain itself by extracting more links to crawl.
SbstnErhrdt/node-readability
Simple node server to extract relevant content from website source code using Mozilla's Readability.js
SvenEichelsheimer/filegazer
FileGazer - deep file analysing and categorisation
TypesetIO/jsuite
Tools for parsing and manipulating JATS XML documents.
dust-ai-mr/dust-html
Dust library for html processing
rithulkamesh/docproc
Opinionated and Sophisticated Document Region Analyzer.
Aish-p/WebScraperAPI
WebScraperAPI is a powerful web application that transforms any website into structured data using the Firecrawl API. It provides an intuitive interface for extracting specific information from websites and converting it into structured formats like JSON and CSV.
thorkill/dbce
Diff Based Content Extraction is a part of my Bachelor Thesis: Joint Approach to Boilerplate Detection in Web Archives
mrinshad/ChatPDF
Document processing and querying system built with FastAPI and React. Upload documents and interact with their content using natural language queries powered by Gemini API and Unstructured.io
simonpierreboucher/Crawler
A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.
mlibre/Deep-Truth
DeepTruth is your ultimate research buddy 🤖 that uses next-gen AI (via Ollama and Google Generative AI) to dig deep, extract exact quotes, and stitch them into a narrative. No fluff, just facts! 🔍🚀