/blog-crawler

It is a blog crawler which is specified to CSDN.

Primary LanguagePython

###function Crawl csdn blog of mine from start to end.

###features of ban

  1. rotate useragent --> useragentmiddleware.py
  2. rotate proxy ip --> proxy.py

###Reusable components

  1. useragentmiddleware.py
  2. proxy.py

###TODO When the ip is useless, scrapy will retry with other ip automaticly. However it is a waste of time so much. useless_ip
So it is importent to choice correct proxy ips. But the best method is to capture the err and do sth. It is the things to do.