treasure-data/pytd

Query successed but caused RecursionError

Gedevan-Aleksizde opened this issue · 5 comments

Sometimes .query cause the error like the following, but the corresponding job status is "success." I cannot find the reason but it tends to happen with a long time job, longer than a few hours.

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "test.py", line 10, in main
    _ = tdcl.query(
  File "/usr/local/lib/python3.8/dist-packages/pytd/client.py", line 245, in query
    res = engine.execute(header + query, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pytd/query_engine.py", line 96, in execute
    self.executed = cur.execute(query)
  File "/usr/local/lib/python3.8/dist-packages/tdclient/cursor.py", line 49, in execute
    self._do_execute()
  File "/usr/local/lib/python3.8/dist-packages/tdclient/cursor.py", line 82, in _do_execute
    return self._do_execute()
  File "/usr/local/lib/python3.8/dist-packages/tdclient/cursor.py", line 82, in _do_execute
    return self._do_execute()
  File "/usr/local/lib/python3.8/dist-packages/tdclient/cursor.py", line 82, in _do_execute
    return self._do_execute()
  [Previous line repeated 2954 more times]
  File "/usr/local/lib/python3.8/dist-packages/tdclient/cursor.py", line 64, in _do_execute
    status = self._api.job_status(self._executed)
  File "/usr/local/lib/python3.8/dist-packages/tdclient/job_api.py", line 170, in job_status
    with self.get(create_url("/v3/job/status/{job_id}", job_id=job_id)) as res:
  File "/usr/local/lib/python3.8/dist-packages/tdclient/api.py", line 185, in get
    response = self.send_request(
  File "/usr/local/lib/python3.8/dist-packages/tdclient/api.py", line 499, in send_request
    return self.http.request(
  File "/usr/local/lib/python3.8/dist-packages/urllib3/request.py", line 66, in request
    return self.request_encode_url(method, url, fields=fields,
  File "/usr/local/lib/python3.8/dist-packages/urllib3/request.py", line 89, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/poolmanager.py", line 324, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 597, in urlopen
    httplib_response = self._make_request(conn, method, url,
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 384, in _make_request
    six.raise_from(e, None)
  File "<string>", line 2, in raise_from
  File "/usr/local/lib/python3.8/dist-packages/urllib3/connectionpool.py", line 380, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/lib/python3.8/http/client.py", line 1347, in getresponse
    response.begin()
  File "/usr/lib/python3.8/http/client.py", line 331, in begin
    self.headers = self.msg = parse_headers(self.fp)
  File "/usr/lib/python3.8/http/client.py", line 225, in parse_headers
    return email.parser.Parser(_class=_class).parsestr(hstring)
  File "/usr/lib/python3.8/email/parser.py", line 67, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "/usr/lib/python3.8/email/parser.py", line 56, in parse
    feedparser.feed(data)
  File "/usr/lib/python3.8/email/feedparser.py", line 176, in feed
    self._call_parse()
  File "/usr/lib/python3.8/email/feedparser.py", line 180, in _call_parse
    self._parse()
  File "/usr/lib/python3.8/email/feedparser.py", line 295, in _parsegen
    if self._cur.get_content_maintype() == 'message':
  File "/usr/lib/python3.8/email/message.py", line 594, in get_content_maintype
    ctype = self.get_content_type()
  File "/usr/lib/python3.8/email/message.py", line 578, in get_content_type
    value = self.get('content-type', missing)
  File "/usr/lib/python3.8/email/message.py", line 471, in get
    return self.policy.header_fetch_parse(k, v)
  File "/usr/lib/python3.8/email/_policybase.py", line 316, in header_fetch_parse
    return self._sanitize_header(name, value)
  File "/usr/lib/python3.8/email/_policybase.py", line 287, in _sanitize_header
    if _has_surrogates(value):
  File "/usr/lib/python3.8/email/utils.py", line 57, in _has_surrogates
    s.encode()
RecursionError: maximum recursion depth exceeded while calling a Python object

@Gedevan-Aleksizde Thank you for reporting the issue.

May I request you to try querying with a larger value of wait_interval?

# 1800sec = 30min, for example
client.query('select symbol, count(1) as cnt from nasdaq group by 1 order by 1', wait_interval=1800)  

I'm not sure how long your job actually takes, but any numbers should work as long as it's reasonably smaller than the running time of the job.

Reason

In pytd.Client#query, what happens behind the scene is to recursively fetch job status until the job finishes, and the time interval of fetching status is defined by the wait_interval parameter: time.sleep(wait_interval)

Here, since the default value of wait_interval is 5sec, long-running jobs may cause too much recursive calls as you pointed out, and the code eventually fails unless we explicitly set a larger interval and decreases the number of recursive calls.

@Gedevan-Aleksizde I briefly tried to research the related issues, but couldn't find the cause with limited stack-trace. It might be an internet connection issue.
https://stackoverflow.com/questions/60432826/sudden-error-after-working-well-for-hours-recursion-error-maximum-recursion-dep

As takuti recommended, can you try to bump wait_interval?

Thank you. I confirmed the aforementioned error doesn't occur with a large wait_interval value (I tested =3600). But do I need estimate the elapsed time of the job and specify the proper value manually (i.e., small value for small job)?

@Gedevan-Aleksizde Sorry for the late reply. Good to hear that larger wait_interval helps.

But do I need estimate the elapsed time of the job and specify the proper value manually (i.e., small value for small job)?

Yes. Unfortunately, this is one of the best approaches users can take to ensure a successful execution.

Meanwhile, considering the error happened due to excessive recursive calls, manually checking & updating system's recursion limit may help if you come up with a reasonable value.

import sys

# 1000 by default. 
# In case `wait_iterval=5`, 1000 * 5 = 5000sec (~83.3min) will be max job duration the script can wait.
sys.getrecursionlimit() 

# You can change the limit.
sys.setrecursionlimit(2000) 

Thank you. This kind of failures are rare and it seems to be often caused from queries with somehow inefficiency. I will try to tackle that with your solution and writing queries carefully.