/awesome-scrapy

A curated list of awesome packages, articles, and other cool resources from the Scrapy community.

Awesome Scrapy Awesome

A curated list of awesome packages, articles, and other cool resources from the Scrapy community. Scrapy is a fast high-level web crawling & scraping framework for Python.

Table of Contents

Apps

Visual Web Scraping

  • Portia Visual scraping for Scrapy

Distributed Spider

Scrapy Service

  • scrapyscript Run a Scrapy spider programmatically from a script or a Celery task - no project required.

  • scrapyd A service daemon to run Scrapy spiders

  • scrapyd-client Command line client for Scrapyd server

  • python-scrapyd-api A Python wrapper for working with Scrapyd's API.

  • SpiderKeeper A scalable admin ui for spider service

  • scrapyrt HTTP server which provides API for scheduling Scrapy spiders and making requests with spiders.

Monitor

Avoid Ban

  • HttpProxyMiddleware A middleware for scrapy. Used to change HTTP proxy from time to time.

  • scrapy-proxies Processes Scrapy requests using a random proxy from list to avoid IP ban and improve crawling speed.

  • scrapy-rotating-proxies Use multiple proxies with Scrapy

  • scrapy-random-useragent Scrapy Middleware to set a random User-Agent for every Request.

  • scrapy-fake-useragent Random User-Agent middleware based on fake-useragent

  • scrapy-crawlera Crawlera routes requests through a pool of IPs, throttling access by introducing delays and discarding IPs from the pool when they get banned from certain domains, or have other problems.

Data Processing

Process Javascript

Other Useful Extensions

  • scrapy-djangoitem Scrapy extension to write scraped items using Django models

  • scrapy-deltafetch Scrapy spider middleware to ignore requests to pages containing items seen in previous crawls

  • scrapy-crawl-once This package provides a Scrapy middleware which allows to avoid re-crawling pages which were already downloaded in previous crawls.

  • scrapy-magicfields Scrapy middleware to add extra fields to items, like timestamp, response fields, spider attributes etc.

  • scrapy-pagestorage A scrapy extension to store requests and responses information in storage service.

Resources

Articles

Exercises

Video

Book