bug
Closed this issue · 5 comments
2016-04-07 19:11:53 [scrapy] INFO: Scrapy 1.0.5 started (bot: abc)
2016-04-07 19:11:53 [scrapy] INFO: Optional features available: ssl, http11
2016-04-07 19:11:53 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'abc.spiders', 'SPIDER_MODULES': ['abc.spiders'], 'COOKIES_ENABLED': False, 'BOT_NAME': 'abc'}
2016-04-07 19:11:53 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2016-04-07 19:11:53 [py.warnings] WARNING: /usr/local/lib/python2.7/dist-packages/scrapy/utils/deprecate.py:155: ScrapyDeprecationWarning: scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware
class is deprecated, use scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
instead
ScrapyDeprecationWarning)
2016-04-07 19:11:53 [scrapy] INFO: Enabled downloader middlewares: RandomUserAgent, ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, ChunkedTransferMiddleware, DownloaderStats
2016-04-07 19:11:53 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2016-04-07 19:11:53 [scrapy] INFO: Enabled item pipelines: abcPipeline
2016-04-07 19:11:53 [scrapy] INFO: Spider opened
2016-04-07 19:11:53 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-04-07 19:11:53 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6026
***********_ProxyMiddleware no pass_*********http://112.124.4.132:80
2016-04-07 19:11:53 [scrapy] ERROR: Error downloading <GET https://www.abc.com/>: Could not open CONNECT tunnel.
2016-04-07 19:11:53 [scrapy] INFO: Closing spider (finished)
2016-04-07 19:11:53 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 1,
'downloader/request_bytes': 347,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 4, 7, 11, 11, 53, 888732),
'log_count/DEBUG': 1,
'log_count/ERROR': 1,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 4, 7, 11, 11, 53, 578478)}
2016-04-07 19:11:53 [scrapy] INFO: Spider closed (finished)
class ProxyMiddleware(object):
def process_request(self, request, spider):
#request.meta['proxy'] = HTTP_PROXY
proxy = random.choice(PROXIES)
print "****ProxyMiddleware no pass**" + proxy
request.meta['proxy'] = "%s" % proxy
return
DOWNLOADER_MIDDLEWARES = {
'abc.middlewares.RandomUserAgent': 1, #随机user agent
'abc.middlewares.ProxyMiddleware': 100, #代理需要用到
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
}
PROXIES = [
'http://111.206.37.87:80',
'http://112.124.4.132:80',
'http://61.232.197.13:80',
'http://121.22.253.39:80',
'http://61.135.204.187:80',
]
我的程序中没有用到什么abcmiddelwares,应该是这块的问题.去掉运行一下
thanks a lot
我按照你说的在spider中设置proxy,确实能速度很快的运行一部分,不过一会儿还是会出现
ERROR: Error downloading Could not open CONNECT tunnel.
TCP connection timed out: 110: Connection timed out.???
middelwares是文档里写的方法,不知道为什么这样不好使,而按你说的写在spider中就可以了?
http://scrapy-chs.readthedocs.org/zh_CN/1.0/topics/downloader-middleware.html#id2
一个IP半小时只能抓取不到400张网页.(印象中是这个样子)
为什么我的有些爬几下就error了。。。
你爬几十万用户大概多久呀?
差不多一天吧