https://docs.scrapy.org/en/latest/intro/tutorial.html
This is my scrapy sandbox. I followed the Scrapy tutorial and then tried other things. Played to crawl with Amazon.com and AllRecipes, but got banned after a few runs.
TODO:
- Discover how to avoid bot-blockers and crawl
tudogostoso.com.br
- Try
fake-user-agents
to not get blocked
- Python3
- Scrapy
- Splash
- Docker (to run the Splash container)
- Selenium and geckodriver (firefox webdriver) - for
quotes_selenium
spider
- Install ubuntu dependencies:
sudo ./configure
- Create a virtual env (
virtualenv -p python3 env
) and source it (source env/bin/activate
). - Install pip dependencies (
pip install -r requirements.txt
) - Run Splash on Docker.
Pull Splash image (
docker pull scrapinghub/splash
) and run the container (docker run -d -p 8050:8050 -p 5023:5023 scrapinghub/splash
).
Implementations from quotes crawler using Spider and CrawlSpider with/out Splash.
Run scrapy crawl <spider_name>
.
Possible spiders:
quotes_crawl
quotes_splash
quotes_crawl_splash
To save on a file the extracted items, use -o <filename
.
Supported extensions include .jl
and .json
among others.
amazon
: crawl products from Amazon.com (searched for ldaptops)allrecipes
: crawl recipes from allrecipes.com.brquotes_selenium
(Run on Selenium)
Note: Running amazon
and allrecipes
crawlers can get banned after a few runs.
- Scrapy Tutorial
- Scrapy Tutorial
- Scrapy with Splash on Docker and Splash
- Python Selenium
- Headless Firefox
- Fake-user-agents can help avoid being banned.