Scrapy Spiderman
A web interface that allows you to control your Scrapy spiders. It allows you to start, stop and monitor your spiders' progress (similar to scrapinghub).
It uses Django/PostgreSQL for the backend and Celery to run the spiders in the background.
It'll automatically generate models from your spider's items. But you'll need to modify your spiders.
Installation
Manual Method
-
Install the required Python packages:
pip install -r requirements.txt
-
Install RabbitMQ Server and start it's service.
-
Install Celery and start it's worker (See supervisord.conf file for the full command).
-
Install PostgreSQL and create a
scrapypanel
database. -
Start the django server
Vagrant
Using-
Install vagrant
-
cd into the project directory
-
do:
vagrant up
Wait until the command is done. You'll have the web interface accessible at http://localhost:8080
Vagrant Notes
If you use the vagrant method and want to add one or more directories to the SPIDER_DIRS
setting,
you'll need to share them from your host machine to the guest VM.
Here's an example to add to Vagrantfile:
# config.vm.synced_folder "C:/Users/Adri/spiders", "/home/vagrant/myspiders"
This will share C:/Users/Adri/spiders into /home/vagrant/myspiders on your VM. So you can now use:
SPIDER_DIRS = ['/home/vagrant/myspiders']
And then collect spiders. This example is available in Vagrantfile to modify.
Update the Settings
There's currently one required setting: SPIDER_DIRS
.
It should be a list of disk directories on your system. Each can contain one or more Scrapy projects (created with the
startproject
command.
Then you'll need to run:
python manage.py collect_spiders
Modify Your Spiders
To create django models (tables) that store your items, you'll need to add an attribute to your spider class:
ITEM_CLASS
.
Like so:
class MySpider(CrawlSpider):
name = ...
allowed_domains = ...
...
ITEM_CLASS = MyScrapedItem
The ITEM_CLASS
is the usual Scrapy Item.