istresearch/scrapy-cluster

Website did't response for a long time,how to solve the problem

Johnson0016 opened this issue · 1 comments

Hi,@madisonb
I used scrapy-cluster recently,It's really useful for huge amount data,but at the same time,I got some problem,that's the issue:
It's caused in a situation that target website did't response for a long time,I know that if status_code of response is 404 or 5xx for several times,scrapy-cluster will reput the url in end of redis queue. Well,it seems doesn't work when the problem(did't get response for long time ) came out.
I did set DOWNLOAD_TIMEOUT to 30 seconds,it doesn't work sometimes. You got some good solution to solve the issue? should I use errback function or ......
Thanks for help!

`Traceback (most recent call last):

File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/twisted/internet/defer.py", line 1384, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/twisted/python/failure.py", line 422, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/scrapy/core/downloader/handlers/http11.py", line 351, in _cb_timeout
raise TimeoutError("Getting %s took longer than %s seconds." % (url, timeout))
twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting https://xxxxxxxxxxxxxxxxxxxxxxxxx.com/ took longer than 30.0 seconds..

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks
result = g.send(result)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/scrapy/core/downloader/middleware.py", line 66, in process_exception
spider=spider)
File "/home/kevin/project/scrapy-cluster/crawler/crawling/log_retry_middleware.py", line 93, in process_exception
self._log_retry(request, exception, spider)
File "/home/kevin/project/scrapy-cluster/crawler/crawling/log_retry_middleware.py", line 107, in _log_retry
self.logger.error('Scraper Retry', extra=extras)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/scutils/log_factory.py", line 254, in error
extras = self.add_extras(extra, "ERROR")
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/site-packages/scutils/log_factory.py", line 329, in add_extras
my_copy = copy.deepcopy(dict)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 247, in _deepcopy_method
return type(x)(x.func, deepcopy(x.self, memo))
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 220, in _deepcopy_tuple
y = [deepcopy(a, memo) for a in x]
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 220, in
y = [deepcopy(a, memo) for a in x]
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 180, in deepcopy
y = _reconstruct(x, memo, *rv)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 280, in _reconstruct
state = deepcopy(state, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 150, in deepcopy
y = copier(x, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 240, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/home/kevin/.virtualenvs/scrapy_cluster/lib/python3.6/copy.py", line 169, in deepcopy
rv = reductor(4)
TypeError: can't pickle select.epoll objects`

We are at the mercy of Scrapy's internal process to ensure that the website returns in a reasonable time frame. I agree that the DOWNLOAD_TIMEOUT doesn't always work correctly, but to help solve this initially my suggestion is to use a lot of spiders in your cluster.

For example, if you have 10 spiders, and only 1 out of every 10 times your get a really long download timeout, at least your other 9 spiders would be working normally.

This project mainly focuses on a distributed scheduler mechanism to enable the spiders to get their tasking from the redis server. It does not do much to control how the spider itself downloads the html from the website.

I am going to close this issue since this seems to be more of a custom use case than a bug in the project, feel free to hop over to Gitter if you would like to chat more.