airbnb/streamalert

[bug][app] box app may timeout if retry kicked in

chunyong-lin opened this issue · 2 comments

Background

We observe that box app may timed out when the retry kicked in during when the lambda timeout is almost ended. The close code is

except requests.exceptions.Timeout:
# Retry requests that timed out a few more times, with increased timeout
timeout *= 2
LOGGER.debug('Attempting new request with timeout: %0.2f seconds', timeout)
if retry_count == self._MAX_RETRY_COUNT:
raise # eventually give up and raise this
return _perform_request(timeout, allow_retry, retry_count + 1)

Following lambda logs from the box app say that the lambda still has 17.67 seconds left and it continued pulling more logs which the retries happened and lambda function timed out.

[INFO]	2020-02-20 22:53:07,020.20Z	33333333-4444-5555-6666-777777777777	[_gather] Function executed in 7.5856 seconds.

[INFO]	2020-02-20 22:53:07,020.20Z	33333333-4444-5555-6666-777777777777	Lambda remaining seconds: 17.67

[WARNING]	2020-02-20 22:53:10,132.132Z	33333333-4444-5555-6666-777777777777	�[31mRequest "GET https://api.box.com/2.0/events" failed with ReadTimeout exception: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='api.box.com', port=443): Read timed out. (read timeout=3.05)"))�[0m

[WARNING]	2020-02-20 22:53:16,351.351Z	33333333-4444-5555-6666-777777777777	�[31mRequest "GET https://api.box.com/2.0/events" failed with ReadTimeout exception: ReadTimeout(ReadTimeoutError("HTTPSConnectionPool(host='api.box.com', port=443): Read timed out. (read timeout=6.1)"))�[0m

END RequestId: 33333333-4444-5555-6666-777777777777
REPORT RequestId: 33333333-4444-5555-6666-777777777777	Duration: 300100.39 ms	Billed Duration: 300000 ms	Memory Size: 128 MB	Max Memory Used: 128 MB	
2020-02-20T22:53:24.794Z 33333333-4444-5555-6666-777777777777 Task timed out after 300.10 seconds

Steps to Reproduce

It is not very easy to reproduce this issue only if you have many logs to pull and the retries would be happened during very last pull in a single lambda invocation.

Desired Change

I am thinking we may need to check remaining time before retries happened and give up the future pull if the remaining time is not enough in a single lambda invocation.

I don't love getting rid of the alert especially since it doesn't page. Would be nice to have this top of mind. I'm on-call next so I can take a look at fixing this.

Sounds good @blakemotl I will close this PR.