/scrapy-sandbox

https://docs.scrapy.org/en/latest/intro/tutorial.html

Primary LanguagePython

scrapy-tutorial + other things

https://docs.scrapy.org/en/latest/intro/tutorial.html

This is my scrapy sandbox. I followed the Scrapy tutorial and then tried other things. Played to crawl with Amazon.com and AllRecipes, but got banned after a few runs.

TODO:

  • Discover how to avoid bot-blockers and crawl tudogostoso.com.br
  • Try fake-user-agents to not get blocked

How do I get set up?

Dependencies

  • Python3
  • Scrapy
  • Splash
  • Docker (to run the Splash container)
  • Selenium and geckodriver (firefox webdriver) - for quotes_selenium spider

Set up (ubuntu)

  1. Install ubuntu dependencies: sudo ./configure
  2. Create a virtual env (virtualenv -p python3 env) and source it (source env/bin/activate).
  3. Install pip dependencies (pip install -r requirements.txt)
  4. Run Splash on Docker. Pull Splash image (docker pull scrapinghub/splash) and run the container (docker run -d -p 8050:8050 -p 5023:5023 scrapinghub/splash).

Usage

Implementations from quotes crawler using Spider and CrawlSpider with/out Splash.

Run scrapy crawl <spider_name>.

Possible spiders:

  • quotes_crawl
  • quotes_splash
  • quotes_crawl_splash

To save on a file the extracted items, use -o <filename. Supported extensions include .jl and .json among others.

Extra Spiders

  • amazon: crawl products from Amazon.com (searched for ldaptops)
  • allrecipes: crawl recipes from allrecipes.com.br
  • quotes_selenium (Run on Selenium)

Note: Running amazon and allrecipes crawlers can get banned after a few runs.

References: