crawling
There are 1115 repositories under crawling topic.
scrapy/scrapy
Scrapy, a fast high-level web crawling & scraping framework for Python.
gocolly/colly
Elegant Scraper and Crawler Framework for Golang
apify/crawlee
Crawlee—A web scraping and browser automation library for Node.js to build reliable crawlers. In JavaScript and TypeScript. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with Puppeteer, Playwright, Cheerio, JSDOM, and raw HTTP. Both headful and headless mode. With proxy rotation.
codelucas/newspaper
newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
lorien/awesome-web-scraping
List of libraries, tools and APIs for web scraping and data processing.
MontFerret/ferret
Declarative web scraping
yujiosaka/headless-chrome-crawler
Distributed crawler powered by Headless Chrome
go-rod/rod
A Chrome DevTools Protocol driver for web automation and scraping.
apify/crawlee-python
Crawlee—A web scraping and browser automation library for Python to build reliable crawlers. Extract data for AI, LLMs, RAG, or GPTs. Download HTML, PDF, JPG, PNG, and other files from websites. Works with BeautifulSoup, Playwright, and raw HTTP. Both headful and headless mode. With proxy rotation.
hakluke/hakrawler
Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
hardkoded/puppeteer-sharp
Headless Chrome .NET API
apache/nutch
Apache Nutch is an extensible and scalable web crawler
transitive-bullshit/awesome-puppeteer
A curated list of awesome puppeteer resources.
lorien/grab
Web Scraping Framework
zorlan/skycaiji
蓝天采集器是一款开源免费的爬虫系统,仅需点选编辑规则即可采集数据,可运行在本地、虚拟主机或云服务器中,几乎能采集所有类型的网页,无缝对接各类CMS建站程序,免登录实时发布数据,全自动无需人工干预!是网页大数据采集软件中完全跨平台的云端爬虫系统
edoardottt/cariddi
Take a list of domains, crawl urls and scan for endpoints, secrets, api keys, file extensions, tokens and more
roach-php/core
The complete web scraping toolkit for PHP.
lorey/mlscraper
🤖 Scrape data from HTML websites automatically by just providing examples
NateScarlet/holiday-cn
📅🇨🇳**法定节假日数据 自动每日抓取国务院公告
ai-robots-txt/ai.robots.txt
A list of AI agents and robots to block.
needleworm/bhban_rpa
<6개월 치 업무를 하루 만에 끝내는 업무 자동화(생능출판사, 2020)>의 예제 코드입니다. 파이썬을 한 번도 배워본 적 없는 분들을 위한 예제이며, 엑셀부터 디자인, 매크로, 크롤링까지 업무 자동화와 관련된 다양한 분야 예제가 제공됩니다.
elixir-crawly/crawly
Crawly, a high-level web crawling & scraping framework for Elixir.
clemfromspace/scrapy-selenium
Scrapy middleware to handle javascript pages using selenium
scrapinghub/scrapyrt
HTTP API for Scrapy spiders
iawia002/Lulu
[Unmaintained] A simple and clean video/music/image downloader 👾
MorvanZhou/easy-scraping-tutorial
Simple but useful Python web scraping tutorial code.
bluet/proxybroker2
The New (auto rotate) Proxy [Finder | Checker | Server]. HTTP(S) & SOCKS :performing_arts:
mishakorzik/AdminHack
today we will hack the admin panel of the site.
slotix/dataflowkit
Extract structured data from web sites. Web sites scraping.
webrecorder/browsertrix-crawler
Run a high-fidelity browser-based web archiving crawler in a single Docker container
essandess/isp-data-pollution
ISP Data Pollution to Protect Private Browsing History with Obfuscation
josephlimtech/linkedin-profile-scraper-api
🕵️♂️ LinkedIn profile scraper returning structured profile data in JSON.
scrapinghub/spidermon
Scrapy Extension for monitoring spiders execution.
zhuyingda/webster
a reliable high-level web crawling & scraping framework for Node.js.
crawljax/crawljax
Crawljax
Florents-Tselai/WarcDB
WarcDB: Web crawl data as SQLite databases.