robusta-dev/robusta

robusta-runner crashes if clusterName is read from env var in values.yaml

nice-pink opened this issue · 5 comments

Describe the bug
I have to set the clusterName value in values.yaml file explicitly. If I read the value from an environment variable the runner crashes.

Using environment variables work in other parts of the values.yaml so it should work here too.

To Reproduce
If installed using values_error.yaml it will fail. If installed using values_success.yaml it will succeed.

values_error.yaml:

clusterName: "{{ env.CLUSTER_NAME }}"
customPlaybooks:
- triggers:
    - on_pod_crash_loop: {}
    - on_pod_oom_killed: {}
    - on_container_oom_killed: {}
    # - on_deployment_update: {}
  actions:
    - resource_babysitter: {}
  sinks:
    - slack
globalConfig:
  signing_key: "{{ env.SIGNING_KEY }}"
  account_id: "{{ env.ACCOUNT_ID }}"
  prometheus_url: "http://prometheus.ops.svc.cluster.local:80"
sinksConfig:
- slack_sink:
    name: main_slack_sink
    slack_channel: alerts
    api_key: "{{ env.SLACK_SINK_API_KEY }}"
- robusta_sink:
    name: robusta_ui_sink
    token: "{{ env.ROBUSTA_SINK_TOKEN }}"
    ttl_hours: 4380
enablePlatformPlaybooks: true
runner:
  resources:
    requests:
      cpu: 100m
      memory: 800Mi
    limits:
      memory: 800Mi
  sendAdditionalTelemetry: false
  additional_env_vars:
  - name: CLUSTER_NAME
    valueFrom:
      configMapKeyRef:
        name: cluster-config
        key: cluster-name
  - name: ACCOUNT_ID
    valueFrom:
      secretKeyRef:
        name: robusta-secret
        key: account_id
  - name: SIGNING_KEY
    valueFrom:
      secretKeyRef:
        name: robusta-secret
        key: signing_key
  - name: SLACK_SINK_API_KEY
    valueFrom:
      secretKeyRef:
        name: robusta-secret
        key: slack_sink_api_key
  - name: ROBUSTA_SINK_TOKEN
    valueFrom:
      secretKeyRef:
        name: robusta-secret
        key: robusta_sink_token

values_success.yaml:

clusterName: dev-cluster
customPlaybooks:
- triggers:
    - on_pod_crash_loop: {}
    - on_pod_oom_killed: {}
    - on_container_oom_killed: {}
    # - on_deployment_update: {}
  actions:
    - resource_babysitter: {}
  sinks:
    - slack
globalConfig:
  signing_key: "{{ env.SIGNING_KEY }}"
  account_id: "{{ env.ACCOUNT_ID }}"
  prometheus_url: "http://prometheus.ops.svc.cluster.local:80"
sinksConfig:
- slack_sink:
    name: main_slack_sink
    slack_channel: cluster-alerts
    api_key: "{{ env.SLACK_SINK_API_KEY }}"
- robusta_sink:
    name: robusta_ui_sink
    token: "{{ env.ROBUSTA_SINK_TOKEN }}"
enablePlatformPlaybooks: true
runner:
  resources:
    requests:
      cpu: 250m
      memory: 1024Mi
    limits:
      memory: 1024Mi
  sendAdditionalTelemetry: false
  additional_env_vars:
  - name: ACCOUNT_ID
    valueFrom:
      secretKeyRef:
        name: robusta-secret
        key: account_id
  - name: SIGNING_KEY
    valueFrom:
      secretKeyRef:
        name: robusta-secret
        key: signing_key
  - name: SLACK_SINK_API_KEY
    valueFrom:
      secretKeyRef:
        name: robusta-secret
        key: slack_sink_api_key
  - name: ROBUSTA_SINK_TOKEN
    valueFrom:
      secretKeyRef:
        name: robusta-secret
        key: robusta_sink_token

Logs
From runner:

setting up colored logging
�[32m2023-12-13 09:50:56.847 INFO     logger initialized using INFO log level�[0m
�[32m2023-12-13 09:50:56.847 INFO     Creating hikaru monkey patches�[0m
�[32m2023-12-13 09:50:56.847 INFO     Creating yaml monkey patch�[0m
�[32m2023-12-13 09:50:56.848 INFO     Creating kubernetes ContainerImage monkey patch�[0m
�[32m2023-12-13 09:50:56.849 INFO     watching dir /etc/robusta/playbooks/ for custom playbooks changes�[0m
�[32m2023-12-13 09:50:56.865 INFO     watching dir /etc/robusta/config/active_playbooks.yaml for custom playbooks changes�[0m
�[32m2023-12-13 09:50:56.865 INFO     Reloading playbook packages due to change on initialization�[0m
�[32m2023-12-13 09:50:56.865 INFO     loading config /etc/robusta/config/active_playbooks.yaml�[0m
�[31m2023-12-13 09:50:56.962 ERROR    unknown error reloading playbooks. will try again when they next change
Traceback (most recent call last):
  File "/app/src/robusta/runner/config_loader.py", line 159, in __reload_playbook_packages
    runner_config = self.__load_runner_config(self.config_file_path)
  File "/app/src/robusta/runner/config_loader.py", line 276, in __load_runner_config
    yaml_content = yaml.safe_load(file)
  File "/usr/local/lib/python3.9/site-packages/yaml/__init__.py", line 125, in safe_load
    return load(stream, SafeLoader)
  File "/usr/local/lib/python3.9/site-packages/yaml/__init__.py", line 81, in load
    return loader.get_single_data()
  File "/usr/local/lib/python3.9/site-packages/yaml/constructor.py", line 51, in get_single_data
    return self.construct_document(node)
  File "/usr/local/lib/python3.9/site-packages/yaml/constructor.py", line 60, in construct_document
    for dummy in generator:
  File "/usr/local/lib/python3.9/site-packages/yaml/constructor.py", line 413, in construct_yaml_map
    value = self.construct_mapping(node)
  File "/usr/local/lib/python3.9/site-packages/yaml/constructor.py", line 218, in construct_mapping
    return super().construct_mapping(node, deep=deep)
  File "/usr/local/lib/python3.9/site-packages/yaml/constructor.py", line 141, in construct_mapping
    raise ConstructorError("while constructing a mapping", node.start_mark,
yaml.constructor.ConstructorError: while constructing a mapping
  in "/etc/robusta/config/active_playbooks.yaml", line 14, column 17
found unhashable key
  in "/etc/robusta/config/active_playbooks.yaml", line 14, column 18�[0m
�[32m2023-12-13 09:50:56.966 INFO     Initialized task queue: 20 workers. Max size 500�[0m
�[32m2023-12-13 09:50:56.982 INFO     Initialized task queue: 20 workers. Max size 500�[0m
�[32m2023-12-13 09:50:57.239 INFO     Setting cluster active to True�[0m
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/app/src/robusta/runner/main.py", line 51, in <module>
    main()
  File "/app/src/robusta/runner/main.py", line 45, in main
    event_handler.set_cluster_active(True)
  File "/app/src/robusta/core/playbooks/playbooks_event_handler_impl.py", line 338, in set_cluster_active
    for sink in self.registry.get_sinks().get_all().values():
AttributeError: 'NoneType' object has no attribute 'get_all'
Exception in thread fs-watcher:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    Exception in thread fs-watcher:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/app/src/robusta/utils/file_system_watcher.py", line 27, in fs_watch
self.run()
  File "/usr/local/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/app/src/robusta/utils/file_system_watcher.py", line 27, in fs_watch
    for _ in watch(self.path_to_watch, stop_event=self.stop_event):
  File "/usr/local/lib/python3.9/site-packages/watchgod/main.py", line 38, in watch
    yield loop.run_until_complete(_awatch.__anext__())
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.9/site-packages/watchgod/main.py", line 121, in __anext__
    for _ in watch(self.path_to_watch, stop_event=self.stop_event):
  File "/usr/local/lib/python3.9/site-packages/watchgod/main.py", line 38, in watch
    yield loop.run_until_complete(_awatch.__anext__())
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.9/site-packages/watchgod/main.py", line 121, in __anext__
    new_changes = await self.run_in_executor(watcher.check)
  File "/usr/local/lib/python3.9/site-packages/watchgod/main.py", line 142, in run_in_executor
    return await self._loop.run_in_executor(self._executor, func, *args)
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 819, in run_in_executor
    executor.submit(func, *args), loop=self)
  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 167, in submit
    new_changes = await self.run_in_executor(watcher.check)
  File "/usr/local/lib/python3.9/site-packages/watchgod/main.py", line 142, in run_in_executor
    return await self._loop.run_in_executor(self._executor, func, *args)
  File "/usr/local/lib/python3.9/asyncio/base_events.py", line 819, in run_in_executor
        raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
executor.submit(func, *args), loop=self)
  File "/usr/local/lib/python3.9/concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
�[32m2023-12-13 09:52:26.670 INFO     SIGINT handler called�[0m
�[32m2023-12-13 09:52:26.670 INFO     Setting cluster active to False�[0m
Exception ignored in: <module 'threading' from '/usr/local/lib/python3.9/threading.py'>
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/threading.py", line 1477, in _shutdown
    lock.acquire()
  File "/app/src/robusta/core/playbooks/playbooks_event_handler_impl.py", line 351, in handle_sigint
    self.set_cluster_active(False)
  File "/app/src/robusta/core/playbooks/playbooks_event_handler_impl.py", line 338, in set_cluster_active
    for sink in self.registry.get_sinks().get_all().values():
AttributeError: 'NoneType' object has no attribute 'get_all'

Expected behavior
I should be able to set any value via environment variable.

Additional context
I have robusta running in various clusters and want to be able to share as much as possible. Fully supporting environment variables in the values.yaml gives the opportunity to configure everything cluster specific in a tiny config_map.

Hey, it's not supported today. What's your motivation to do it via a tiny ConfigMap as opposed to a per-cluster Helm override value? (And are you installing w/ Flux or ArgoCD?)

I'dlove to understand the use case a little more.

Hey @aantn, thanks for your reply. We have several edge clusters all running the same apps. We use argo cd for managing the clusters. Each edge cluster contains a single kustomize file referring the template cluster and a cluster specific config map. The config map defines some cluster specific variables like the cluster name. As all of the clusters when deploying contain a config map with the same name, we can easily reference the variables from the config map. In this case it would be easy to load the value as environment variable. We do the same e.g. with prometheus for external labels.

Any plans on supporting this soonish?

Sorry, no update on this yet. Is this a blocker for your adoption?

It would make the setup ways easier. As described above it reduces the cluster setup of the cloned clusters quite a lot.