web-crawler
There are 965 repositories under web-crawler topic.
mendableai/firecrawl
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
crawlab-team/crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
ssssssss-team/spider-flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
BruceDone/awesome-crawler
A collection of awesome web crawler,spider in different languages
adithya-s-k/omniparse
Ingest, parse, and optimize any data format ➡️ from documents to multimedia ➡️ for enhanced compatibility with GenAI frameworks
apify/crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
apache/nutch
Apache Nutch is an extensible and scalable web crawler
sjdirect/abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
mendableai/firecrawl-mcp-server
Official Firecrawl MCP Server - Adds powerful web scraping to Cursor, Claude and any other LLM clients.
jasonxtn/Argus
The Ultimate Information Gathering Toolkit
xianhu/PSpider
简单易用的Python爬虫框架,QQ交流群:597510560
MarginaliaSearch/MarginaliaSearch
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
Algebra-FUN/WeReadScan
扫描“微信读书”已购图书并下载本地PDF的爬虫
apache/incubator-stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
platonai/PulsarRPA
Automate webpages at scale, scrape web data completely and accurately with high performance, distributed AI-RPA.
postmodern/spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
gildas-lormeau/single-file-cli
CLI tool for saving a faithful copy of a complete web page in a single HTML file (based on SingleFile)
webrecorder/browsertrix-crawler
Run a high-fidelity browser-based web archiving crawler in a single Docker container
cxcscmu/Craw4LLM
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
VIDA-NYU/ache
ACHE is a web crawler for domain-specific search.
scrapfly/scrapfly-scrapers
Scalable Python web scraping scripts for +40 popular domains
hyunwoongko/kochat
Opensource Korean chatbot framework
USCDataScience/sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
devflowinc/firecrawl-simple
➖ Stripped down, stable version of firecrawl optimized for self-hosting and ease of contribution. Billing logic and AI features are completely removed. Crawl and convert any website into LLM-ready markdown.
brendonboshell/supercrawler
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
crwlrsoft/crawler
Library for Rapid (Web) Crawler and Scraper Development
lefterisloukas/edgar-crawler
The only open-source toolkit that can download SEC EDGAR financial reports and extract textual data from specific item sections into nice & clean structured JSON files.
rivermont/spidy
The simple, easy to use command line web crawler.
commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
infinilabs/crawler
🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)
lewisdonovan/google-news-scraper
Lightweight scraper for Google News
s0rg/crawley
The unix-way web crawler
yields/ant
A web crawler for Go
microfisher/Strong-Web-Crawler
基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。
duyet/awesome-web-scraper
A collection of awesome web scaper, crawler.