Django Scraper was written for scrapy 0.7. Since then, scrapy 0.8 came out that had many improvement, many of these improvements make Django Scraper architecturally obsolete. I abandoned this project a while ago.
if you’re looking for this kind of functionality, I would recommend that you look into celery in combination with latest version of scrapy. This would give you a scallable implementation of task based scraping.
Django Scraper app is an integration of Django Web Framework and Scrapy Web Crawling Framework. It was created to simplify
scraping of large websites that contain a variety of data that needs to be extracted in different ways.
As I began working with scrapy I found it difficult to manage the complexity of the website that I was trying to scrape,
because scrapy architecture requires you to have 1 spider per domain. This contraint made it difficult for me to structure the
code in a clear and modular way because the code for all of the scraping tasks had to be in the same spider.
I prefer to think of spiders as having tasks. This makes it easier for me to work on specific spider functionality without
involving all of the other spider tasks.
To work on this way, I introduced a concept of a spider Task. A spider task is something that a spider has to do and it produces
either items or other spider tasks.
Tasks are stored in Django database and can be manipulated using Django admin interface. Django admin allows you to
add new tasks, view status of tasks, filter tasks.
Tasks have similar properties to scrapy Requests, except they take multiple urls using the start_urls property.
Django Scraper App functions like a standard Django Application. If you follow a non django code organization then you would
install djangoscraper as you would any other django application.
- Create project structure
django-admin startproject example scrapy-ctl.py startproject scraper mv scraper/* example rm -R scraper
- Add ‘djangoscraper’ to INSTALLED_APPS in django’s settings.py
- Add ‘djangoscraper.commands’ to COMMANDS_MODULE in scrapy’s settings.py
- To access scrapy from django, add the following code somewhere in django’s settings.py
os.environ.setdefault('SCRAPYSETTINGS_MODULE', 'scraper.settings')
- To access django from scrapy, add the following code somewhere in scrapy’s settings.py
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'settings')
- Create django project
django-admin startproject {django_project_name}
- Move scraper into django project
mv {scraper_project_dir}/* {django_project_name}
- Add ‘djangoscraper’ to INSTALLED_APPS in django’s settings.py
- Add ‘djangoscraper.commands’ to COMMANDS_MODULE in scrapy’s settings.py
- To access scrapy from django, add the following code somewhere in django’s settings.py
os.environ.setdefault('SCRAPYSETTINGS_MODULE', '{scraper_project_dir}.settings')
- To access django from scrapy, add the following code somewhere in scrapy’s settings.py
os.environ.setdefault('DJANGO_SETTINGS_MODULE', 'settings')