/Douban_Crawler

创新实践豆瓣爬虫

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Douban Crawler

Build Status Python Version License Type

Introduction

Douban Crawler is a scrapy crawler project for crawling movie and book information of https://douban.com.

Architecture

The architecture of this project is as follows:

* the required data of a movie contains:

  1. Movie Name
  2. Director
  3. Release Time
  4. Country

* the review data consists of:

  1. Movie Name
  2. Review Title
  3. Review Author
  4. Review Content
  5. Up Number
  6. Down Number
  7. Rate

The graph below shows the underlying architecture of scrapy-redis:

Features

  • Distributed Crawling: Given that the data is of overwhelming size, distributed cralwing is inevitable for our project.

  • Robustness: Douban has its anti-robots scheme, e.g. Crawl-delay: 5, therefore strategies like changing user-agent and proxy ip etc. are siginificant in our crawling practice.

License

GPL License.