This is a scrapy project for R18 web scraping, and also as an example for Scrapy technology and CI tools from Github Marketplace.
- Python 3.6+
- Scrapy 1.6.0
- Fully tested on Linux, but it should works on Windows, Mac OSX, BSD
Run docker-compose in docker folder to initial a MongoDB server:
docker-compose up -d
If you don't want to view log message:
docker-compose up -d && docker-compose logs --follow
Initial postgres with senty first:
1. Generate secret key first:
docker run --rm sentry config generate-secret-key
2. Use the secret key to create a database in postgres:
docker run --detach \ --name sentry-redis-init \ --volume $PWD/redis-data:/data \ redis docker run --detach \ --name sentry-postgres-init \ --env POSTGRES_PASSWORD=secret \ --env POSTGRES_USER=sentry \ --volume $PWD/postgres-data:/var/lib/postgresql/data \ postgres docker run --interactive --tty --rm \ --env SENTRY_SECRET_KEY='<secret-key>' \ --link sentry-postgres-init:postgres \ --link sentry-redis-init:redis \ sentry upgrade
Then input the superusername and password
3. Stop the redis and postgres:
docker stop sentry-postgres-init sentry-redis-init && docker rm sentry-postgres-init senty-redis-init
- Edit the env files to add the superusername, password and database related information
5. Start sentry with docker-compose.yml:
docker-compose up --detach && docker-compose logs --follow
Pipenv is adopted for the virtual environment management. Create the virtual environment and activate it:
pipenv install && pipenv shell
Go to the project root and run the command:
cd run && python run.py
Run the following command to stop MongoDB:
docker-compose down --volumes
- SitemapSpider
- Stats Collection
- Requests and Responses
- Item Loader
- Spider Contracts
- Downloading and processing files and images
- [X] Move zh page re-direction to en to a downloader middleware
- [X] Docker configurations for MongoBD