[BUG] - live_response_api - Free JobWorker on timeout error
litemanawake opened this issue · 3 comments
I am seeing this behaviour on: (please complete the following information):
- OS: Windows
- Carbon Black Cloud Products: Endpoint Standard
- Python Version: 3.9.7
Describe the bug
When a live response session cannot be created due to TimeoutError / 404:
Error encountered by JobWorker[7369023]: Timed out when requesting /appservices/v6/orgs/_org_id_/liveresponse/sessions/12345:7369023 from API with HTTP status code 404: Could not establish session with device 7369023
The JobWorker (Future) is not stopped and does not return. Keeping JobWorkers tied up and not allowing additional Live Response Jobs to begin work.
Steps to Reproduce
Steps to reproduce the behavior (Provide a log message if relevant):
Using the structure from examples/jobrunner.py
# collect 'future' objects for all jobs
for device in online_devices:
f = cb.live_response.submit_job(jobobject.run, device)
futures[f] = device.id
# iterate over all the futures
for f in as_completed(futures.keys(), timeout=100):
if f.exception() is None:
print("Device {0} had result:".format(futures[f]))
print(f.result())
completed_devices.append(futures[f])
else:
print("Device {0} had error:".format(futures[f]))
print(f.exception())
still_to_do = set([s.id for s in online_devices]) - set(completed_devices)
print("The following devices were attempted but not completed or errored out:")
for device in still_to_do:
print(" {0}".format(device))
If session create times out, the future is not marked as Done
Expected behavior
If a session cannot be established with a device, return the exception and free up the JobWorker
MaxWorkers is capped at 10. If one times out, it should get freed up.
Updated logs -- the JobWorker does exit and get deleted, but the Future object is not marked as completed, and so it never returns in the as_completed()
loop.
2022-03-31 12:28:32,783 - cbc_sdk.live_response_api - DEBUG - url: /appservices/v6/orgs/orgid/liveresponse/sessions/123456:8696711 -> status: PENDING
2022-03-31 12:28:34,007 - cbc_sdk.live_response_api - DEBUG - url: /appservices/v6/orgs/orgid/liveresponse/sessions/123456:8696711 -> status: PENDING
2022-03-31 12:28:35,217 - cbc_sdk.live_response_api - DEBUG - url: /appservices/v6/orgs/orgid/liveresponse/sessions/123456:8696711 -> status: PENDING
2022-03-31 12:28:36,450 - cbc_sdk.live_response_api - DEBUG - url: /appservices/v6/orgs/orgid/liveresponse/sessions/123456:8696711 -> status: PENDING
2022-03-31 12:28:37,663 - cbc_sdk.live_response_api - DEBUG - Got item: <cbc_sdk.live_response_api.WorkerStatus object at 0x000001D0068761C0>
2022-03-31 12:28:37,664 - cbc_sdk.live_response_api - ERROR - Error encountered by JobWorker[8696711]: Timed out when requesting /appservices/v6/orgs/orgid/liveresponse/sessions/123456:8696711 from API with HTTP status code 404: Could not establish session with device 8696711
2022-03-31 12:28:37,665 - cbc_sdk.live_response_api - DEBUG - Entering scheduler
2022-03-31 12:28:37,666 - cbc_sdk.live_response_api - DEBUG - There are idle workers for device ids set()
2022-03-31 12:28:37,666 - cbc_sdk.live_response_api - DEBUG - 0 jobs ready to execute in existing execution slots
2022-03-31 12:28:37,666 - cbc_sdk.live_response_api - DEBUG - Waiting for item on Scheduler Queue
2022-03-31 12:28:37,667 - cbc_sdk.live_response_api - DEBUG - Got item: <cbc_sdk.live_response_api.WorkerStatus object at 0x000001D0068761F0>
2022-03-31 12:28:37,667 - cbc_sdk.live_response_api - DEBUG - JobWorker[8696711] has exited, waiting...
2022-03-31 12:28:37,667 - cbc_sdk.live_response_api - DEBUG - JobWorker[8696711] deleted
2022-03-31 12:28:37,667 - cbc_sdk.live_response_api - DEBUG - Entering scheduler
2022-03-31 12:28:37,668 - cbc_sdk.live_response_api - DEBUG - There are idle workers for device ids set()
2022-03-31 12:28:37,668 - cbc_sdk.live_response_api - DEBUG - 0 jobs ready to execute in existing execution slots
2022-03-31 12:28:37,668 - cbc_sdk.live_response_api - DEBUG - Waiting for item on Scheduler Queue
Fix in PR #324
Fix provided; issue closed.