/weibo-crawler2

A python tool package for crawling weibo data from weibo.cn.

Primary LanguagePython

crawler2

    Generally, this is a python tool package for network crawling.
    Last month I've just post the first version of the package. Then I felt it intolerable always being waiting to avoid redirected by weibo.cn. So I decided to use some proxies to fake the HTTP requests from the package.
    Embarrasedly I found it out of date to use official HTTP package urllib2. In fact pythoners tend to use requests which encapsulates urllib3 to do network crawling. So I rewrote crawler with requests in this version.
    The package has 5 modules:

  • crawler: Functions about crawling web pages start with a specified url.
  • weiboParser: Functions about parsing an individual page to get weibo data items.
  • weiboLogin: Functions and classes about login procedure of weibo.(weibo.cn only for now)
  • proxies: Functions and data structures about proxies such as a proxy pool.
  • DB: Just 2 functions including init_DB() and write_DB(data_list).

Depends

    This package have only several dependent packages.     You can install them by following commands under Ubuntu:
    sudo apt-get install Python-bs4
    sudo apt-get install Python-lxml
    sudo apt-get install Python-requests
    Or you could check bs4 source, lxml source and requests source on PyPI.

Usage

  • You could run the package by executing command python crawler2 at the directory where the package is.
  • You may need to modify some config arguments in __init__.py.
  • The parser module may need to be rewrote by yourself if necessary.