crawler2

    Generally, this is a python tool package for network crawling.
    Last month I've just post the first version of the package. Then I felt it intolerable always being waiting to avoid redirected by weibo.cn. So I decided to use some proxies to fake the HTTP requests from the package.
    Embarrasedly I found it out of date to use official HTTP package urllib2. In fact pythoners tend to use requests which encapsulates urllib3 to do network crawling. So I rewrote crawler with requests in this version.
    The package has 5 modules:

crawler: Functions about crawling web pages start with a specified url.
weiboParser: Functions about parsing an individual page to get weibo data items.
weiboLogin: Functions and classes about login procedure of weibo.(weibo.cn only for now)
proxies: Functions and data structures about proxies such as a proxy pool.
DB: Just 2 functions including init_DB() and write_DB(data_list).

Depends

    This package have only several dependent packages.     You can install them by following commands under Ubuntu:
    sudo apt-get install Python-bs4
    sudo apt-get install Python-lxml
    sudo apt-get install Python-requests
    Or you could check bs4 source, lxml source and requests source on PyPI.

Usage

You could run the package by executing command python crawler2 at the directory where the package is.
You may need to modify some config arguments in __init__.py.
The parser module may need to be rewrote by yourself if necessary.

wujun/weibo-crawler2

crawler2

Depends

Usage