/docker-scrapy-tor

Scrapy environment with Tor for anonymous ip routing and Privoxy for http proxy

Primary LanguageShell

Scrapy 1.4.0 environment with Tor for anonymous ip routing and Privoxy for http proxy.

Run:

docker run -it br8kpoint/scrapy-tor

Run a spider:

cd /my/scrapy/project
docker run -it -v $(pwd):/opt br8kpoint/scrapy-tor crawl my_spider

Run scraping console:

cd /my/scrapy/project
docker run -it -v $(pwd):/opt br8kpoint/scrapy-tor shell "http://web.to.scrape"

No further configuration is needed for the Scrapy settings, since the proxy middleware (scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware) will be activated by default using the HTTP proxy address (http://127.0.0.1:8118) set up in the environment.

Extending with more requirements:

FROM br8kpoint/scrapy-tor
ADD requirements.txt ./
RUN pip install -r requirements.txt

Extending with MongoDB:

FROM br8kpoint/scrapy-tor
RUN pip install pymongo==3.2
docker build -t scrapy-tor-mongo .
docker run -v /path/to/data:/data/db --name mongodb -d mongo
docker run -it --link mongodb:mongodb scrapy-tor-mongo
# Scrapy project settings
import os
...
MONGO_HOST = os.environ['MONGODB_PORT_27017_TCP_ADDR']
MONGO_PORT = os.environ['MONGODB_PORT_27017_TCP_PORT']
...