Awesome Scrapy

A curated list of awesome packages, articles, and other cool resources from the Scrapy community. Scrapy is a fast high-level web crawling & scraping framework for Python.

Apps
Resources
- Articles
- Exercises
- Video
- Book

Apps

Visual Web Scraping

Portia Visual scraping for Scrapy

Distributed Spider

scrapy-cluster Distributed on demand scraping cluster using Redis and Kafka.
scrapy-redis Redis-based components for Scrapy.

Scrapy Service

scrapyscript Run a Scrapy spider programmatically from a script or a Celery task - no project required.
scrapyd A service daemon to run Scrapy spiders
scrapyd-client Command line client for Scrapyd server
python-scrapyd-api A Python wrapper for working with Scrapyd's API.
SpiderKeeper A scalable admin ui for spider service
scrapyrt HTTP server which provides API for scheduling Scrapy spiders and making requests with spiders.

Front-End Scrapy Managers

Gerapy Distributed Crawler Management Framework Based on Scrapy, Scrapyd, Scrapyd-Client, Scrapyd-API, Django and Vue.js
SpiderKeeper admin ui for scrapy/open source scrapinghub.
ScrapydWeb Scrapyd cluster management, Scrapy log analysis & visualization, Basic auth, Auto packaging, Timer Tasks, Email notice, and Mobile UI.

Monitor

scrapy-sentry Logs Scrapy exceptions into Sentry
scrapy-statsd-middleware Statsd integration middleware for scrapy
scrapy-jsonrpc An extension to control a running Scrapy web crawler via JSON-RPC
scrapy-fieldstats A Scrapy extension to log items coverage when the spider shuts down
spidermon Extension which provides useful tools for data validation, stats monitoring, and notification messages.

Avoid Ban

HttpProxyMiddleware A middleware for scrapy. Used to change HTTP proxy from time to time.
scrapy-proxies Processes Scrapy requests using a random proxy from list to avoid IP ban and improve crawling speed.
scrapy-rotating-proxies Use multiple proxies with Scrapy
scrapy-random-useragent Scrapy Middleware to set a random User-Agent for every Request.
scrapy-fake-useragent Random User-Agent middleware based on fake-useragent
scrapy-crawlera Crawlera routes requests through a pool of IPs, throttling access by introducing delays and discarding IPs from the pool when they get banned from certain domains, or have other problems.

Data Processing

scrapy-elasticsearch A scrapy pipeline which send items to Elastic Search server
scrapy-mongodb MongoDB pipeline for Scrapy.
scrapy-mysql-pipeline MySQL pipeline to persist items in MySQL databases.
scrapy-s3pipeline Scrapy pipeline to store chunked items into AWS S3 bucket
scrapy-sqs-exporter Scrapy extension for outputting scraped items to an Amazon SQS instance
scrapy-kafka-export Scrapy extension which writes crawled items to Kafka
scrapy-rss-exporter An RSS exporter for Scrapy

Process Javascript

scrapy-playwright Enable scraping dynamic pages using PlayWright.
scrapy-puppeteer Make Scrapy and Puppeteer work together to handle Javascript-rendered pages.
scrapy-splash Make Scrapy can understand Javascript

Other Useful Extensions

scrapy-djangoitem Scrapy extension to write scraped items using Django models
scrapy-deltafetch Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls
scrapy-crawl-once This package provides a Scrapy middleware which allows to avoid re-crawling pages which were already downloaded in previous crawls.
scrapy-magicfields Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.
scrapy-pagestorage A scrapy extension to store requests and responses information in storage service.
itemloaders Library to populate items using XPath and CSS with a convenient API.
itemadapter Adapter which provides a common interface to handle objects of different types in an uniform manner.
scrapy-poet Page Object pattern implementation which enables writing reusable and portable extraction and crawling code.

AccordBox/awesome-scrapy

Awesome Scrapy

Table of Contents

Apps

Visual Web Scraping

Distributed Spider

Scrapy Service

Front-End Scrapy Managers

Monitor

Avoid Ban

Data Processing

Process Javascript

Other Useful Extensions

Resources

Articles

Exercises

Video

Book