web-crawler
There are 880 repositories under web-crawler topic.
apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
crawlab-team/crawlab
Distributed web crawler admin platform for spiders management regardless of languages and frameworks. 分布式爬虫管理平台,支持任何语言和框架
ssssssss-team/spider-flow
新一代爬虫平台,以图形化方式定义爬虫流程,不写代码即可完成爬虫。
mendableai/firecrawl
🔥 Turn entire websites into LLM-ready markdown or structured data. Scrape, crawl and extract with a single API.
BruceDone/awesome-crawler
A collection of awesome web crawler,spider in different languages
apache/nutch
Apache Nutch is an extensible and scalable web crawler
sjdirect/abot
Cross Platform C# web crawler framework built for speed and flexibility. Please star this project! +1.
xianhu/PSpider
简单易用的Python爬虫框架,QQ交流群:597510560
MarginaliaSearch/MarginaliaSearch
Internet search engine for text-oriented websites. Indexing the small, old and weird web.
apache/incubator-stormcrawler
A scalable, mature and versatile web crawler based on Apache Storm
postmodern/spidr
A versatile Ruby web spidering library that can spider a site, multiple domains, certain links or infinitely. Spidr is designed to be fast and easy to use.
spider-rs/spider
The fastest web crawler written in Rust. Maintained by @a11ywatch.
platonai/PulsarRPA
Automate webpages at scale, scrape web data completely and accurately with high performance, distributed RPA.
webrecorder/browsertrix-crawler
Run a high-fidelity browser-based crawler in a single Docker container
Algebra-FUN/WeReadScan
扫描“微信读书”已购图书并下载本地PDF的爬虫
hyunwoongko/kochat
Opensource Korean chatbot framework
VIDA-NYU/ache
ACHE is a web crawler for domain-specific search.
USCDataScience/sparkler
Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
brendonboshell/supercrawler
A web crawler. Supercrawler automatically crawls websites. Define custom handlers to parse content. Obeys robots.txt, rate limits and concurrency limits.
rivermont/spidy
The simple, easy to use command line web crawler.
commoncrawl/news-crawl
News crawling with StormCrawler - stores content as WARC
crwlrsoft/crawler
Library for Rapid (Web) Crawler and Scraper Development
infinilabs/crawler
🕷️ An easy-to-use spider written in Golang. (previous named GOPA.)
microfisher/Strong-Web-Crawler
基于C#.NET+PhantomJS+Sellenium的高级网络爬虫程序。可执行Javascript代码、触发各类事件、操纵页面Dom结构。
yields/ant
A web crawler for Go
lucasxlu/LagouJob
Data Analysis & Mining for lagou.com
antchfx/antch
Antch, a fast, powerful and extensible web crawling & scraping framework for Go
duyet/awesome-web-scraper
A collection of awesome web scaper, crawler.
TurnerSoftware/InfinityCrawler
A simple but powerful web crawler library for .NET
s0rg/crawley
The unix-way web crawler
crawler-commons/crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
xiayouran/Musicer
旨在将网易云、酷狗、QQ、酷我等各音乐平台集于一体
crawlab-team/crawlab-lite
Lite version of Crawlab. 轻量版 Crawlab 爬虫管理平台
lewisdonovan/google-news-scraper
Lightweight scraper for Google News
Hecate2/Ignareo-ISML-auto-voter
Ignareo the Carillon, a web crawler/spider template of ultimate high concurrency built for leprechauns. Carillons as the best web spiders; Long live the golden years of leprechauns! (ISML=international saimoe; 2022 ISML is last ISML)
elliotxx/zhihu-crawler-people
A simple distributed crawler for zhihu && data analysis