A simple web spider frame written by Python, which needs Python3.5+
- Support multi-threading crawling mode (using threading and requests)
- Support distributed crawling mode (using threading, requests and redis)
- Support using proxies for crawling (using threading and queue)
- Define some utility functions and classes, for example: UrlFilter, get_string_num, etc
- Fewer lines of code, easyer to read, understand and expand
- utilities module: define some utilities functions and classes for spider
- instances module: define classes of fetcher, parser, saver for multi-threading spider
- concurrent module: define WebSpiderFrame of multi-threading spider and distributed spider
-
procedure of multi-threading spider
①: Fetcher gets url from UrlQueue, and makes requests based on this url
②: Put the result of ① to HtmlQueue, and so Parser can get it
③: Parser gets item from HtmlQueue, and parses it to get new urls and saved items
④: Put the new urls to UrlQueue, and so Fetcher can get it
⑤: Put the saved items to ItemQueue, and so Saver can get it
⑥: Saver gets item from ItemQueue, and saves it to filesystem or database
⑦: Proxieser gets proxies from web or database and puts proxies to ProxiesQueue
⑧: Fetcher gets proxies from ProxiesQueue if needed, and makes requests based on this proxies -
procedure of distributed spider
Similar with multi-threading spider. The only difference is getting url from redis instead of queue.
Installation: you'd better use the first method
(1)Copy the "spider" directory to your project directory, and import spider
(2)Install spider to your python system using python3 setup.py install
See test.py