/django-spider

a multi-threaded spider with a web interface

Primary LanguagePythonMIT LicenseMIT

django-spider

a multi-threaded spider with a web interface

image

list of sessions for a site

image

session detail

image

dependencies:

running

first, make sure you pip install the requirements:

pip install httplib2
pip install lxml
pip install -e git+https://github.com/coleifer/django-utils.git#egg=djutils
pip install -e git+https://github.com/coleifer/django-spider.git#egg=spider

add djutils and spider to your settings file and make sure you run manage.py syncdb.

add spider.urls to your root urlconf:

from django.conf import settings
from django.conf.urls.defaults import *
from django.contrib import admin

admin.autodiscover()

urlpatterns = patterns('',
    url(r'^admin/', include(admin.site.urls)),
    url(r'', include('spider.urls')),
)

make sure the media in the spider app is copied into your static media directory.

start up the task queue:

# assume your cwd is the root dir of virtualenv
export DJANGO_SETTINGS_MODULE=mysite.settings
./bin/python ./src/djutils/djutils/queue/bin/consumer.py start -l ./logs/queue.log -p ./run/queue.pid