rancher/opni

Workload AIOps "deadlock" issue / Cannot uninstall AiOps

alexandreLamarre opened this issue · 5 comments

I had previously configured the deployment watchlist (~26 pods accross 3 clusters), but its not showing up here

Deployment watch list issue :

aiops.mp4

Results in not being able to uninstall AiOps at all

The status reported by the model training plugin never seems to change after ~30mins:

{"status":"training","statistics":{"timeElapsed":"0","percentageCompleted":"0","remainingTime":"0","currentEpoch":"0","modelAccuracy":0,"stage":"fetching data"}}
Opni svc inference pod logs

/opt/venv/lib64/python3.9/site-packages/elasticsearch/connection/http_urllib3.py:209: UserWarning: Connecting to https://opni-opensearch-svc.opni.svc:9200 using SSL with verify_certs=False is insecure.
  warnings.warn(
/app/opni_inference_service/./start_opnilog_inference.py:218: DeprecationWarning: The loop argument is deprecated since Python 3.8, and scheduled for removal in Python 3.10.
  logs_queue = asyncio.Queue(loop=loop)
2023-11-02 14:20:32,286 - INFO - Attempting to connect to NATS
2023-11-02 14:20:32,313 - INFO - Connected to NATS at opni-nats-client.opni.svc:4222... with cid: 70
2023-11-02 14:20:32,313 - INFO - Current server info: {'server_id': 'NBAYYVFXWJQIEI465ZUICTBZVTOX4PMFSI47FFXCPRKUZ7DBCJGRTZCJ', 'server_name': 'opni-nats-2', 'version': '2.8.4', 'proto': 1, 'git_commit': '66524ed', 'go': 'go1.17.10', 'host': '0.0.0.0', 'port': 4222, 'headers': True, 'auth_required': True, 'max_payload': 8388608, 'jetstream': True, 'client_id': 70, 'client_ip': '10.0.5.30', 'nonce': '1lRhq9dnT152SF4', 'cluster': 'opni', 'connect_urls': ['10.0.0.253:4222', '10.0.7.148:4222', '10.0.25.113:4222']}
2023-11-02 14:20:32,313 - INFO - NATS stats: {'in_msgs': 0, 'out_msgs': 0, 'in_bytes': 0, 'out_bytes': 0, 'reconnects': 0, 'errors_received': 0}
2023-11-02 14:20:32,313 - INFO - NATS options: {'verbose': True, 'pedantic': False, 'name': None, 'allow_reconnect': True, 'dont_randomize': False, 'reconnect_time_wait': 5, 'max_reconnect_attempts': -1, 'ping_interval': 120, 'max_outstanding_pings': 2, 'no_echo': False, 'user': None, 'password': None, 'token': None, 'connect_timeout': 2, 'drain_timeout': 30, 'tls_handshake_first': False}
2023-11-02 14:20:37,893 - ERROR - Cannot currently obtain necessary model files. Exiting function
2023-11-02 14:20:37,894 - INFO - initializing...
2023-11-02 14:20:37,902 - ERROR - No OpniLog model currently [Errno 2] No such file or directory: 'output/vocab.txt'
Opni training controller logs

2023-11-02 14:20:52,598 - INFO - Attempting to connect to NATS
2023-11-02 14:20:52,626 - INFO - Connected to NATS at opni-nats-client.opni.svc:4222... with cid: 2482409
2023-11-02 14:20:52,627 - INFO - Current server info: {'server_id': 'NB2EXPZYKREBYQM2VC63LKUZBO4MTNXA5GC2MOYR3SLCMPDCD2RIEXJE', 'server_name': 'opni-nats-0', 'version': '2.8.4', 'proto': 1, 'git_commit': '66524ed', 'go': 'go1.17.10', 'host': '0.0.0.0', 'port': 4222, 'headers': True, 'auth_required': True, 'max_payload': 8388608, 'jetstream': True, 'client_id': 2482409, 'client_ip': '10.0.15.17', 'nonce': 'DFOcInR9xZ4aOZA', 'cluster': 'opni', 'connect_urls': ['10.0.7.148:4222', '10.0.25.113:4222', '10.0.0.253:4222']}
2023-11-02 14:20:52,627 - INFO - NATS stats: {'in_msgs': 0, 'out_msgs': 0, 'in_bytes': 0, 'out_bytes': 0, 'reconnects': 0, 'errors_received': 0}
2023-11-02 14:20:52,627 - INFO - NATS options: {'verbose': True, 'pedantic': False, 'name': None, 'allow_reconnect': True, 'dont_randomize': False, 'reconnect_time_wait': 5, 'max_reconnect_attempts': -1, 'ping_interval': 120, 'max_outstanding_pings': 2, 'no_echo': False, 'user': None, 'password': None, 'token': None, 'connect_timeout': 2, 'drain_timeout': 30, 'tls_handshake_first': False}
2023-11-02 14:20:52,646 - INFO - Fetching size of the disk
2023-11-02 14:20:52,646 - INFO - Disk Total: 19 GiB
2023-11-02 14:20:52,646 - INFO - Disk Used: 10 GiB
2023-11-02 14:20:52,647 - INFO - Disk Free: 9 GiB
2023-11-02 14:20:52,647 - INFO - Retrieve sample logs from ES
Thu, 02 Nov 2023 14:20:53 GMT | starting dump
(node:11) Warning: Setting the NODE_TLS_REJECT_UNAUTHORIZED environment variable to '0' makes TLS connections and HTTPS requests insecure by disabling certificate verification.
(Use `node-default --trace-warnings ...` to show where the warning was created)
Thu, 02 Nov 2023 14:20:53 GMT | Error Emitted => {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [logs]","index":"logs","resource.id":"logs","resource.type":"index_or_alias","index_uuid":"_na_"}],"type":"index_not_found_exception","reason":"no such index [logs]","index":"logs","resource.id":"logs","resource.type":"index_or_alias","index_uuid":"_na_"},"status":404}
Thu, 02 Nov 2023 14:20:53 GMT | Error Emitted => {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [logs]","index":"logs","resource.id":"logs","resource.type":"index_or_alias","index_uuid":"_na_"}],"type":"index_not_found_exception","reason":"no such index [logs]","index":"logs","resource.id":"logs","resource.type":"index_or_alias","index_uuid":"_na_"},"status":404}
Thu, 02 Nov 2023 14:20:53 GMT | Total Writes: 0
Thu, 02 Nov 2023 14:20:53 GMT | dump ended with error (get phase) => NOT_FOUND: {"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index [logs]","index":"logs","resource.id":"logs","resource.type":"index_or_alias","index_uuid":"_na_"}],"type":"index_not_found_exception","reason":"no such index [logs]","index":"logs","resource.id":"logs","resource.type":"index_or_alias","index_uuid":"_na_"},"status":404}
2023-11-02 14:20:53,520 - ERROR - Sample failed to download
2023-11-02 14:20:53,520 - ERROR - [Errno 2] No such file or directory: '/var/opni-data/sample_logs.json'
workload drain logs

2023-11-02 14:21:07,291 - INFO - fail_keywords_str = (fail)|(error)|(missing)|(unable)
2023-11-02 14:21:07,375 - INFO - Connected to S3 client
2023-11-02 14:21:07,390 - INFO - opni-drain-model bucket does not exist so creating it now
2023-11-02 14:21:07,397 - INFO - Starting Drain3 template miner
2023-11-02 14:21:07,397 - INFO - Loading configuration from drain3.ini
2023-11-02 14:21:07,398 - INFO - Checking for saved state
2023-11-02 14:21:07,411 - ERROR - Cannot currently obtain DRAIN model file
2023-11-02 14:21:07,412 - INFO - Saved state not found
2023-11-02 14:21:07,412 - INFO - connecting to nats
2023-11-02 14:21:07,439 - INFO - Connected to NATS at opni-nats-client.opni.svc:4222... with cid: 73
2023-11-02 14:21:07,439 - INFO - Current server info: {'server_id': 'NBAYYVFXWJQIEI465ZUICTBZVTOX4PMFSI47FFXCPRKUZ7DBCJGRTZCJ', 'server_name': 'opni-nats-2', 'version': '2.8.4', 'proto': 1, 'git_commit': '66524ed', 'go': 'go1.17.10', 'host': '0.0.0.0', 'port': 4222, 'headers': True, 'auth_required': True, 'max_payload': 8388608, 'jetstream': True, 'client_id': 73, 'client_ip': '10.0.18.51', 'nonce': 'zM6HuXlyB1AqDcc', 'cluster': 'opni', 'connect_urls': ['10.0.0.253:4222', '10.0.7.148:4222', '10.0.25.113:4222']}
2023-11-02 14:21:07,439 - INFO - NATS stats: {'in_msgs': 0, 'out_msgs': 0, 'in_bytes': 0, 'out_bytes': 0, 'reconnects': 0, 'errors_received': 0}
2023-11-02 14:21:07,440 - INFO - NATS options: {'verbose': True, 'pedantic': False, 'name': None, 'allow_reconnect': True, 'dont_randomize': False, 'reconnect_time_wait': 5, 'max_reconnect_attempts': -1, 'ping_interval': 120, 'max_outstanding_pings': 2, 'no_echo': False, 'user': None, 'password': None, 'token': None, 'connect_timeout': 2, 'drain_timeout': 30}
Opni svc gpu controller logs

Search...
/opt/venv/lib64/python3.9/site-packages/elasticsearch/connection/http_urllib3.py:209: UserWarning: Connecting to https://opni-opensearch-svc.opni.svc:9200 using SSL with verify_certs=False is insecure.
warnings.warn(
/app/opni_inference_service/./start_opnilog_inference.py:218: DeprecationWarning: The loop argument is deprecated since Python 3.8, and scheduled for removal in Python 3.10.
logs_queue = asyncio.Queue(loop=loop)
2023-11-02 14:23:48,178 - INFO - Attempting to connect to NATS
2023-11-02 14:23:48,203 - INFO - Connected to NATS at opni-nats-client.opni.svc:4222... with cid: 228871
2023-11-02 14:23:48,203 - INFO - Current server info: {'server_id': 'NA5M3FXZ5KC42F2GWQZACNSDMZ7O7BOL4DXNUSBULKAYRNC7FLS57YSO', 'server_name': 'opni-nats-1', 'version': '2.8.4', 'proto': 1, 'git_commit': '66524ed', 'go': 'go1.17.10', 'host': '0.0.0.0', 'port': 4222, 'headers': True, 'auth_required': True, 'max_payload': 8388608, 'jetstream': True, 'client_id': 228871, 'client_ip': '10.0.24.1', 'nonce': '9iT9exDgDylZbrU', 'cluster': 'opni', 'connect_urls': ['10.0.25.113:4222', '10.0.7.148:4222', '10.0.0.253:4222']}
2023-11-02 14:23:48,203 - INFO - NATS stats: {'in_msgs': 0, 'out_msgs': 0, 'in_bytes': 0, 'out_bytes': 0, 'reconnects': 0, 'errors_received': 0}
2023-11-02 14:23:48,203 - INFO - NATS options: {'verbose': True, 'pedantic': False, 'name': None, 'allow_reconnect': True, 'dont_randomize': False, 'reconnect_time_wait': 5, 'max_reconnect_attempts': -1, 'ping_interval': 120, 'max_outstanding_pings': 2, 'no_echo': False, 'user': None, 'password': None, 'token': None, 'connect_timeout': 2, 'drain_timeout': 30, 'tls_handshake_first': False}
/app/opni_inference_service/./start_opnilog_inference.py:235: DeprecationWarning: The loop argument is deprecated since Python 3.8, and scheduled for removal in Python 3.10.
job_queue = asyncio.Queue(loop=loop)
2023-11-02 14:23:48,298 - ERROR - Cannot currently obtain necessary model files. Exiting function
2023-11-02 14:23:48,298 - INFO - initializing...
2023-11-02 14:23:48,306 - ERROR - No OpniLog model currently [Errno 2] No such file or directory: 'output/vocab.txt'

Restarting these pods manually or restarting them by switching AiOps storage settings appears to cause the training controller to hang:

Also note it looks like the training controller receives multiple job start signals on its nats channel

2023-11-02 16:02:18,714 - INFO - POST https://opni-opensearch-svc.opni.svc:9200/logs/_count [status:200 request:0.249s]
2023-11-02 16:02:18,715 - INFO - payload : {'max_size': 6111259, 'query': {'query': {'bool': {'filter': [{'range': {'time': {'gte': 1698937338453, 'lte': 1698940938453}}}], 'minimum_should_match': 1, 'should': [{'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': '16adacf0-e2d2-4ed5-a2ba-a8bb953f521e AND opni-agent AND opni-collector-aggregator'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': '16adacf0-e2d2-4ed5-a2ba-a8bb953f521e AND opni-agent AND opni-agent-kube-state-metrics'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': '16adacf0-e2d2-4ed5-a2ba-a8bb953f521e AND opni-agent AND opni-agent'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': '61f6be68-7542-47da-b24f-0e7664d832e9 AND opni-agent AND opni-agent-kube-state-metrics'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': '61f6be68-7542-47da-b24f-0e7664d832e9 AND opni-agent AND opni-collector-aggregator'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': '61f6be68-7542-47da-b24f-0e7664d832e9 AND opni-agent AND opni-agent'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': '61f6be68-7542-47da-b24f-0e7664d832e9 AND tigera-operator AND tigera-operator'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND cortex-ruler'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-gateway'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-manager'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-svc-preprocessing'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-agent'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-svc-drain'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-svc-opensearch-update'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-otel-preprocessor'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND cortex-distributor'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-dashboards'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-kube-state-metrics'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND grafana'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND cortex-query-frontend'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND cortex-purger'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-collector-aggregator'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-inference-opni-model-controlplane'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-inference-opni-model-longhorn'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-inference-opni-model-rancher'}}, {'query_string': {'fields': ['cluster_id', 'namespace_name.keyword', 'deployment.keyword'], 'query': 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6 AND opni AND opni-kube-prometheus-stack-operator'}}], 'must_not': [{'match': {'anomaly_level.keyword': 'Anomaly'}}, {'query_string': {'query': '(error) or (fail) or (fatal) or (exception) or (timeout) or (unavailable) or (crash) or (connection refused) or (network error) or (deadlock) or (out of disk) or (high load)', 'default_field': 'log'}}]}}}, 'count': 126553, 'parameters': {'uuid': '98141599-f6b0-4d4d-bbfd-7fa2e751a815', 'workloads': {'16adacf0-e2d2-4ed5-a2ba-a8bb953f521e': {'opni-agent': ['opni-collector-aggregator', 'opni-agent-kube-state-metrics', 'opni-agent']}, '61f6be68-7542-47da-b24f-0e7664d832e9': {'opni-agent': ['opni-agent-kube-state-metrics', 'opni-collector-aggregator', 'opni-agent'], 'tigera-operator': ['tigera-operator']}, 'b0e6aa24-0c22-4ee1-9cb7-0e7c18d010c6': {'opni': ['cortex-ruler', 'opni-gateway', 'opni-manager', 'opni-svc-preprocessing', 'opni-agent', 'opni-svc-drain', 'opni-svc-opensearch-update', 'opni-otel-preprocessor', 'cortex-distributor', 'opni-dashboards', 'opni-kube-state-metrics', 'grafana', 'cortex-query-frontend', 'cortex-purger', 'opni-collector-aggregator', 'opni-inference-opni-model-controlplane', 'opni-inference-opni-model-longhorn', 'opni-inference-opni-model-rancher', 'opni-kube-prometheus-stack-operator']}}}}
2023-11-02 16:02:18,716 - INFO - Just received signal to begin running the jobs
2023-11-02 16:02:18,717 - INFO - message from training : JobStart
2023-11-02 16:02:18,719 - INFO - message from training : JobStart
2023-11-02 16:02:18,719 - INFO - message from training : JobStart

model training status reported by the model training plugin is still 'fetching_data"

Synced up with Amartya and I am basically soft-locked into not being able to use workload models, because there is no way to forcefully clear the watchlist KV