PidCheck is a generic crawler for extracting data about PiD's from landing pages and doing some calculation on the health of the link. It is based upon the Scrapy framework for doing most of the hard work. It is configured for broad crawling to hit multiple domains and does this in a polite way by default.
While the project actually includes a basic non redis backed version, the architecture is designed to have a redis store for both feeding urls to check and storing the data for further processing.
For starting a version of the crawler and a redis, you can just do regular
docker-compose up
For debugging purposes you can use the seperate debug compose file
docker-compose -f docker-compose.debug.yml up
With this running you can push data into redis using redis-cli:
src/redis-cli -p 6379 lpush pidcheck:start_urls '{ "pid": "msk0-7250", "url": "https://blog.datacite.org/datacite-hiring-another-application-developer/" }'
The following are important settings that you can override with environment variables. It is possible to use a .env file for changing these settings as well.
- USER_AGENT - Specify a user agent so sites can identify your bot. default: pidcheck
- LOG_LEVEL - Standard python logging levels can be set, default: INFO
- REDIS_HOST - Host for specifying a different redis* default: redis
- REDIS_PORT - Port for specifiying a different redis* default: 6379
Note specifying a different redis, you will want to use only the crawler docker image and not the redis one.
The redis has a SEED_URL key in the format of: "pidcheck:start_urls".
You can push directly using the redis-cli:
src/redis-cli -p 6379 lpush pidcheck:start_urls '{ "pid": "msk0-7250", "url": "https://blog.datacite.org/datacite-hiring-another-application-developer/" }'
For conveniance there is also a scripts/seed.py that can take either a json lines format with each line being a json object:
'{ "pid": "msk0-7250", "url": "https://blog.datacite.org/datacite-hiring-another-application-developer/" }'
or accepts a CSV file with the columns being: pid, url
Example:
python scripts/seed.py myurls.csv
To retrieve the results from the scraping you can use the dump.py script to output the data:
python scripts/dump.py mydata.csv
- Python 3
pip install -r requirements.txt
It is a scrapy project so regular scrapy crawl commands should work.