/advanced-web-scraping-tutorial

The Zipru scraper developed in the Advanced Web Scraping Tutorial.

Primary LanguagePython

Advanced Web Scraping Tutorial Project

This repository is a companion to the article Advanced Web Scraping: Bypassing captcha, "403 Forbidden," and more. Please refer to the article for further details.

This is a scrapy web scraper for the fictional Zipru torrent site. It is designed to bypass four distinct anti-scraping mechanisms:

  1. User agent filtering.
  2. Obfuscated javascript redirects.
  3. Captchas.
  4. Header consistency checks.

The scraper is not actually functional because Zipru is not a real site. The code, however, is otherwise complete and can easily be adapted to work on other sites.