Generally, this is a python tool package for network crawling.
Last month I've just post the first version of the package. Then I felt it intolerable always being waiting to avoid redirected by weibo.cn. So I decided to use some proxies to fake the HTTP requests from the package.
Embarrasedly I found it out of date to use official HTTP package urllib2. In fact pythoners tend to use requests which encapsulates urllib3 to do network crawling. So I rewrote crawler with requests in this version.
The package has 5 modules:
- crawler: Functions about crawling web pages start with a specified url.
- weiboParser: Functions about parsing an individual page to get weibo data items.
- weiboLogin: Functions and classes about login procedure of weibo.(weibo.cn only for now)
- proxies: Functions and data structures about proxies such as a proxy pool.
- DB: Just 2 functions including init_DB() and write_DB(data_list).
This package have only several dependent packages.
You can install them by following commands under Ubuntu:
sudo apt-get install Python-bs4
sudo apt-get install Python-lxml
sudo apt-get install Python-requests
Or you could check bs4 source, lxml source and requests source on PyPI.
- You could run the package by executing command
python crawler2
at the directory where the package is. - You may need to modify some config arguments in __init__.py.
- The parser module may need to be rewrote by yourself if necessary.