crawling

There are 1167 repositories under crawling topic.

scrapy/scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
Language:Python58.2k 1.8k 3.2k11k
gocolly/colly
Elegant Scraper and Crawler Framework for Golang
Language:Go24.6k 327 5581.8k
apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
Language:TypeScript19.5k 117 1k1k
codelucas/newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
Language:HTML14.8k 383 6772.1k
lorien/awesome-web-scraping
List of libraries, tools and APIs for web scraping and data processing.
Language:Makefile7.3k 232 10806
D4Vinci/Scrapling
🕷️ An undetectable, powerful, flexible, high-performance Python library to make Web Scraping Easy and Effortless as it should be!
Language:Python7.3k 26 21410
apify/crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
Language:Python6.3k 34 373439
go-rod/rod
A Chrome DevTools Protocol driver for web automation and scraping.
Language:Go6.2k 49 976411
MontFerret/ferret
Declarative web scraping
Language:Go5.9k 99 299309
yujiosaka/headless-chrome-crawler
Distributed crawler powered by Headless Chrome
Language:JavaScript5.6k 115 135409
hakluke/hakrawler
Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
Language:Go4.8k 62 105532
hardkoded/puppeteer-sharp
Headless Chrome .NET API
Language:C#3.8k 54 1.6k461
ai-robots-txt/ai.robots.txt
A list of AI agents and robots to block.
Language:Python3.1k 39 34126
apache/nutch
Apache Nutch is an extensible and scalable web crawler
Language:Java3.1k 234 01.3k
edoardottt/cariddi
Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more
Language:Go2.8k 17 69249
transitive-bullshit/awesome-puppeteer
A curated list of awesome puppeteer resources.
2.5k 51 7160
lorien/grab
Web Scraping Framework
Language:Python2.4k 87 220274
zorlan/skycaiji
蓝天采集器是一款开源免费的爬虫系统，仅需点选编辑规则即可采集数据，可运行在本地、虚拟主机或云服务器中，几乎能采集所有类型的网页，无缝对接各类CMS建站程序，免登录实时发布数据，全自动无需人工干预！是网页大数据采集软件中完全跨平台的云端爬虫系统
Language:PHP2k 79 43593
NateScarlet/holiday-cn
📅🇨🇳**法定节假日数据自动每日抓取国务院公告
Language:Python1.4k 19 24154
roach-php/core
The complete web scraping toolkit for PHP.
Language:PHP1.4k 18 6977
lorey/mlscraper
🤖 Scrape data from HTML websites automatically by just providing examples
Language:Python1.4k 16 3291
needleworm/bhban_rpa
<6개월 치 업무를 하루 만에 끝내는 업무 자동화(생능출판사, 2020)>의 예제 코드입니다. 파이썬을 한 번도 배워본 적 없는 분들을 위한 예제이며, 엑셀부터 디자인, 매크로, 크롤링까지 업무 자동화와 관련된 다양한 분야 예제가 제공됩니다.
Language:Python1.1k 6 81.1k
elixir-crawly/crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
Language:Elixir1.1k 19 108121
rebrowser/rebrowser-patches
Collection of patches for puppeteer and playwright to avoid automation detection and leaks. Helps to avoid Cloudflare and DataDome CAPTCHA pages. Easy to patch/unpatch, can be enabled/disabled on demand.
Language:JavaScript1k 23 9457
clemfromspace/scrapy-selenium
Scrapy middleware to handle javascript pages using selenium
Language:Python940 20 92361
bluet/proxybroker2
The New (auto rotate) Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS :performing_arts:
Language:Python882 14 63118
webrecorder/browsertrix-crawler
Run a high-fidelity browser-based web archiving crawler in a single Docker container
Language:TypeScript872 23 372113
scrapinghub/scrapyrt
HTTP API for Scrapy spiders
Language:Python852 44 95160
iawia002/Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
Language:Python810 47 53142
MorvanZhou/easy-scraping-tutorial
Simple but useful Python web scraping tutorial code.
Language:Jupyter Notebook806 41 5545
mishakorzik/AdminHack
today we will hack the admin panel of the site.
Language:Shell770 23 19131
slotix/dataflowkit
Extract structured data from web sites. Web sites scraping.
Language:Go690 23 1381
josephlimtech/linkedin-profile-scraper-api
🕵️‍♂️ LinkedIn profile scraper returning structured profile data in JSON.
Language:TypeScript688 13 38161
cxcscmu/Craw4LLM
Official repository for "Craw4LLM: Efficient Web Crawling for LLM Pretraining"
Language:Python637 4 956
essandess/isp-data-pollution
ISP Data Pollution to Protect Private Browsing History with Obfuscation
Language:Python606 39 2952
scrapinghub/spidermon
Scrapy Extension for monitoring spiders execution.
Language:Python545 72 171100

crawling

scrapy/scrapy

gocolly/colly

apify/crawlee

codelucas/newspaper

lorien/awesome-web-scraping

D4Vinci/Scrapling

apify/crawlee-python

go-rod/rod

MontFerret/ferret

yujiosaka/headless-chrome-crawler

hakluke/hakrawler

hardkoded/puppeteer-sharp

ai-robots-txt/ai.robots.txt

apache/nutch

edoardottt/cariddi

transitive-bullshit/awesome-puppeteer

lorien/grab

zorlan/skycaiji

NateScarlet/holiday-cn

roach-php/core

lorey/mlscraper

needleworm/bhban_rpa

elixir-crawly/crawly

rebrowser/rebrowser-patches

clemfromspace/scrapy-selenium

bluet/proxybroker2

webrecorder/browsertrix-crawler

scrapinghub/scrapyrt

iawia002/Lulu

MorvanZhou/easy-scraping-tutorial

mishakorzik/AdminHack

slotix/dataflowkit

josephlimtech/linkedin-profile-scraper-api

cxcscmu/Craw4LLM

essandess/isp-data-pollution

scrapinghub/spidermon