/Crawler_frame_alpha

利用 urllib2 对网站数据进行抓取, 提供了 GET 和 POST 两种方法,对于每种方法可设置是否使用代理(proxy_enable)

Primary LanguagePython

/*******************************************************************/
/* Copyright (C) tmxmall, yizhe, 2016-2017                         */
/*                                                                 */
/*  FILE NAME             :  Readme                                */
/*  PRINCIPAL AUTHOR      :  zpGao                                 */
/*  SUBSYSTEM NAME        :  Crawler                               */
/*  MODULE NAME           :  menu                                  */
/*  LANGUAGE              :  Python                                */
/*  TARGET ENVIRONMENT    :  ANY                                   */
/*  DATE OF FIRST RELEASE :  2016/05/31                            */
/*  DESCRIPTION           :  This is a spider program frame        */
/*******************************************************************/

# INSUTRUCTION : 
# 
# 1. go to ./Proxy dir, update the IP/Port/Type, update Proxy 
# 2. Format the url in ./Url dir
# 3. run python crawler.py 
# 4. crawled data will be stored in ./Data/data_pools
# 5. If any Error / Exception occured, crawled_urllist / uncrawled_urllist 
#    will be write to file crawled / uncrawled in ./Data, if success, 
#    uncrawled data will be empyt.
# 
#