PeskyPotato/archive-chan

archive-chan hangs after a while when downloading whole boards

cardoso-neto opened this issue · 6 comments

I noticed this yesterday. After a while (say half an hour), archive-chan just hangs for some reason.
I thought this was something to do with the requests.get() call with no timeout, so I replaced every call with a custom safe_get() function I managed to throw together after skimming some requests tutorials.
However, it still hung even using my function.

So maybe it's something else? I'm not the best at debugging, though.
What I ran was python archiver.py pol -p -r 3 -v --use_db.

Doubling the number of processes in the Pool also had no noticeable effect.

I checked the ./threads/ folder and most of the *.html files were not written yet, so when I hit ctrl c, I lost all of the text which it probably had in memory.
Maybe we should dump it all before exiting when catching a KeyboardInterrupt. so it isn't lost.

This is the function I wrote to test the timeout theory:

from typing import Text, Optional

from requests import PreparedRequest, Response, Session 
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry


class TimeoutHTTPAdapter(HTTPAdapter):
    def __init__(self, *args, timeout: Optional[int] = None, **kwargs):
        self.timeout = 10
        if timeout is not None:
            self.timeout = timeout
        super().__init__(*args, **kwargs)

    def send(self, request: PreparedRequest, timeout=None, **kwargs) -> Response:
        if timeout is None:
            kwargs["timeout"] = self.timeout
        return super().send(request, **kwargs)


def safe_get(url: str, max_retries: int = 3, timeout: int = 10) -> Response:
    retry_strategy = Retry(
        total=max_retries,
        backoff_factor=1,
        status_forcelist=[413, 429, 500, 502, 503, 504],
    )
    adapter = TimeoutHTTPAdapter(timeout=timeout, max_retries=retry_strategy)

    session = Session()
    session.mount("https://", adapter)
    session.mount("http://", adapter)

    response = session.get(url)
    return response

I just realized I forgot to replace the urllib.request.Request on line 18 of extractor.py as well as the urlretrieve that downloads the media files with a GET request that supports timeouts. Maybe it could be one of those? I'll try and hack around a solution.

Thank you for taking the time to document this, I really appreciate it. I can't spend much time on my side project anymore but I'm glad you are able to make use of it. I'll take a closer look at this at some point.

I'll probably fix every bug I find, because I've found this incredibly useful, though the HTML template you created for the threads is a bit lacking, imo.
Eventually I'll flood you with pull requests.

Btw, after adding timeouts for every one of them REST requests, I found the little bugger.
I'll probably raise the timeout wait times a bit and then find a clean way to implement "retry n times on timeout".

multiprocessing.pool.RemoteTraceback: 

Traceback (most recent call last):
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/urllib3/connectionpool.py", line 381, in _make_request
    self._validate_conn(conn)
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/urllib3/connectionpool.py", line 978, in _validate_conn
    conn.connect()
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/urllib3/connection.py", line 371, in connect
    ssl_context=context,
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/urllib3/util/ssl_.py", line 384, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)
  File "/home/neto/miniconda/envs/chan/lib/python3.7/ssl.py", line 423, in wrap_socket
    session=session
  File "/home/neto/miniconda/envs/chan/lib/python3.7/ssl.py", line 870, in _create
    self.do_handshake()
  File "/home/neto/miniconda/envs/chan/lib/python3.7/ssl.py", line 1139, in do_handshake
    self._sslobj.do_handshake()
socket.timeout: _ssl.c:1074: The handshake operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/urllib3/util/retry.py", line 403, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/urllib3/connectionpool.py", line 384, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/urllib3/connectionpool.py", line 336, in _raise_timeout
    self, url, "Read timed out. (read timeout=%s)" % timeout_value
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='a.4cdn.org', port=443): Read timed out. (read timeout=10)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/neto/miniconda/envs/chan/lib/python3.7/multiprocessing/pool.py", line 121, in worker
    result = (True, func(*args, **kwds))
  File "/home/neto/miniconda/envs/chan/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "archiver.py", line 88, in archive
    extractor.extract(thread, params)
  File "/home/neto/apps/archive-chan/extractors/fourchan_api.py", line 21, in extract
    self.get_data(thread, params)
  File "/home/neto/apps/archive-chan/extractors/fourchan_api.py", line 33, in get_data
    timeout=10,
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/home/neto/miniconda/envs/chan/lib/python3.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='a.4cdn.org', port=443): Read timed out. (read timeout=10)



The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "archiver.py", line 137, in <module>
    main()
  File "archiver.py", line 132, in main
    feeder(url)
  File "archiver.py", line 121, in feeder
    res.get(86400)
  File "/home/neto/miniconda/envs/chan/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
requests.exceptions.ReadTimeout: None: None

I'm fairly certain I fixed it.
I'm not 100% sure though, since it's not an easily reproducible bug.
It's on my hang-up-fix-issue-3 branch on cardoso-neto/archive-chan/hang-up-fix-issue-3 if you feel like testing it for yourself.

My commit messages contain some relevant information on what I did:

image

edit: cardoso-neto@a88cb78 .. cardoso-neto@b614458 commit range in case I delete that branch.

This can be fixed by adding timeouts and retries on all requests.get calls.
Feel free to use my special Session which encapsulates the timeouting and retrying logic.