content-extraction

There are 44 repositories under content-extraction topic.

mendableai/firecrawl-mcp-server
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
Language:JavaScript2.4k 19 20211
graphlit/graphlit-mcp-server
Model Context Protocol (MCP) Server for Graphlit Platform
Language:TypeScript359 1 021
currentslab/extractnet
A fork of Dragnet that also extract author, headline, date, keywords from context, as well as built in metadata extraction all in one package
Language:HTML293 5 1524
mvasilkov/readability2
Readability2 converts HTML to plain text.
Language:TypeScript108 9 315
tuffstuff9/nextjs-pdf-parser
Next.js template for seamless PDF parsing using pdf2json and FilePond. Ideal for developers seeking a ready-to-use solution for PDF content extraction in Next.js projects.
Language:TypeScript63 1 26
gregors/boilerpipe-ruby
Pure ruby implementation of the Boilerpipe content extraction algorithm tuned for online articles
Language:Ruby43 2 15
nikitautiu/learnhtml
Web content extraction using machine learning
Language:HTML33 5 19
spences10/mcp-jinaai-reader
🔍 Model Context Protocol (MCP) tool for parsing websites using the Jina.ai Reader
Language:JavaScript30 1 33
oiwn/dom-content-extraction
DOM Based Content Extraction via Text Density
Language:Rust25 1 62
gdamdam/sumo
Tool to extracts the text from a web article urls and get frequency words, entities recognition, automatic summary and more
Language:Python20 2 15
pdfix/pdfix_sdk_example_cpp
Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
Language:C++20 4 14
bencmc/youtube_video_summarizer
This repository houses a Python application for extracting YouTube video transcripts and summarizing its content.
Language:Python14 1 25
timoteostewart/benson
Benson turns a list of URLs into mp3s of the contents of each web page - take control over your reading backlog!
Language:Python14 2 01
LandWhale2/TD-Spider
Via Text Density Simple Web Crawler With Go
Language:Go13 2 00
peremenov/seize
Seize is light Node or Browser web-page content extractor inspired by arc90 readability and Safari Reader
Language:HTML12 3 01
zeoagency/mobile-first-indexing-tool
Mobile First Indexing Tool
Language:Python12 2 03
helioLJ/youtube-transcript-copier
Chrome extension to copy YouTube transcripts with AI-friendly features
Language:JavaScript8 2 00
leroyanders/acrticle-scrapper
This Python-based repository hosts a sophisticated service designed for scraping web articles and converting them into Markdown format. The core functionality of this service includes extracting the main content of articles, such as headlines, key paragraphs, and associated images, and then seamlessly transforming this content into well-structured…
Language:Python5 2 01
minarc/godensity
This repository is implematation of 📄 DOM based content extraction via text density. Tested for Korean web pages.
Language:Go5 0 00
Solrikk/DataDigger
DataDigger is a powerful and intuitive web application designed to extract and analyze data from web pages.
Language:Go5 1 00
arman-bd/www2any
A web application that scrapes web pages, extracts main content, and uses OpenLLaMA to convert the content into specified formats.
Language:HTML4 1 0
baughmann/tikara
The metadata and text content extractor for almost every file type.
Language:Python4 1 90
amirthfultehrani/Youtube-Transcript-Copier
A userscript that adds a button to YouTube video pages for copying the transcript with or without timestamps.
Language:JavaScript3 1 00
newben420/gdelt_utility
A web-based utility for fetching, categorizing, summarizing and managing global news and articles using the GDELT 2.0 API. Designed for content creators, news aggregators, and researchers, this tool simplifies access to up-to-date articles with an intuitive UI and customizable configurations.
Language:JavaScript3 1 00
pdfix/pdfix_sdk_example_node_js
Example project demonstrating how to use PDFix SDK WebAssembly build in Node.js. Make PDF Files Accessible, Extract Data from PDF, Convert PDF to HTML, Fill-in PDF Form, Stamp PDF and more...
Language:JavaScript3 2 00
rmwkwok/crawler
Multi-process crawler which extracts main content and sustain itself by extracting more links to crawl.
Language:Python3 3 80
SbstnErhrdt/node-readability
Simple node server to extract relevant content from website source code using Mozilla's Readability.js
Language:JavaScript3 1 00
SvenEichelsheimer/filegazer
FileGazer - deep file analysing and categorisation
3 1 00
TypesetIO/jsuite
Tools for parsing and manipulating JATS XML documents.
Language:Python3 12 12
dust-ai-mr/dust-html
Dust library for html processing
Language:Java2 1 00
rithulkamesh/docproc
Opinionated and Sophisticated Document Region Analyzer.
Language:Python2 1 90
Aish-p/WebScraperAPI
WebScraperAPI is a powerful web application that transforms any website into structured data using the Firecrawl API. It provides an intuitive interface for extracting specific information from websites and converting it into structured formats like JSON and CSV.
Language:Python10
thorkill/dbce
Diff Based Content Extraction is a part of my Bachelor Thesis: Joint Approach to Boilerplate Detection in Web Archives
Language:HTML1 1 01
mrinshad/ChatPDF
Document processing and querying system built with FastAPI and React. Upload documents and interact with their content using natural language queries powered by Gemini API and Unstructured.io
Language:JavaScript00
simonpierreboucher/Crawler
A robust, modular web crawler built in Python for extracting and saving content from websites. This crawler is specifically designed to extract text content from both HTML and PDF files, saving them in a structured format with metadata.
Language:Python00
mlibre/Deep-Truth
DeepTruth is your ultimate research buddy 🤖 that uses next-gen AI (via Ollama and Google Generative AI) to dig deep, extract exact quotes, and stitch them into a narrative. No fluff, just facts! 🔍🚀
Language:JavaScript

content-extraction

mendableai/firecrawl-mcp-server

graphlit/graphlit-mcp-server

currentslab/extractnet

mvasilkov/readability2

tuffstuff9/nextjs-pdf-parser

gregors/boilerpipe-ruby

nikitautiu/learnhtml

spences10/mcp-jinaai-reader

oiwn/dom-content-extraction

gdamdam/sumo

pdfix/pdfix_sdk_example_cpp

bencmc/youtube_video_summarizer

timoteostewart/benson

LandWhale2/TD-Spider

peremenov/seize

zeoagency/mobile-first-indexing-tool

helioLJ/youtube-transcript-copier

leroyanders/acrticle-scrapper

minarc/godensity

Solrikk/DataDigger

arman-bd/www2any

baughmann/tikara

amirthfultehrani/Youtube-Transcript-Copier

newben420/gdelt_utility

pdfix/pdfix_sdk_example_node_js

rmwkwok/crawler

SbstnErhrdt/node-readability

SvenEichelsheimer/filegazer

TypesetIO/jsuite

dust-ai-mr/dust-html

rithulkamesh/docproc

Aish-p/WebScraperAPI

thorkill/dbce

mrinshad/ChatPDF

simonpierreboucher/Crawler

mlibre/Deep-Truth