shotover/shotover-proxy

Improve transform/chain naming in logs

Opened this issue · 1 comments

rukai commented

For the purposes of logging and metrics we use a name to refer to a specific transform and to specific transform chains.

We want to rewrite this naming system to ensure that the displayed names have the following properties:

  • Names should be unique. If the topology contains multiple transforms of the same type then they should be named such that the user can tell which transform in the topology.yaml is responsible for that log or metric. Same goes for transform chains.
  • Implementation-wise they should get the name in the same way e.g. passed by argument or read from a global.

Maybe some of these goals are impossible or we need to compromise somewhere, but this should be our goal at least.

I have arrived at 2 possible designs to solve this problem.

Derived names

---
sources:
  - redis:
      name: "redis"
      listen_addr: "127.0.0.1:6379"
      connection_limit: 3000000
      chain:
        - Tee:
            behavior: Ignore
            buffer_size: 10000
            chain:
            - QueryTypeFilter:
                DenyList: [Read]
            - Coalesce:
                flush_when_buffered_message_count: 2000
                # Use an unreasonably large timeout here so that integration tests dont break on slow hardware or a performance regression
                flush_when_millis_since_last_flush: 1000000000
            - QueryCounter:
                name: "backup chain"
            - RedisSinkCluster:
                first_contact_points: [ "127.0.0.1:2120", "127.0.0.1:2121", "127.0.0.1:2122", "127.0.0.1:2123", "127.0.0.1:2124", "127.0.0.1:2125" ]
                connect_timeout_ms: 3000
        - QueryCounter:
            name: "primary chain"
        - RedisSinkCluster:
            first_contact_points: [ "127.0.0.1:2220", "127.0.0.1:2221", "127.0.0.1:2222", "127.0.0.1:2223", "127.0.0.1:2224", "127.0.0.1:2225" ]
            connect_timeout_ms: 3000
  - redis:
      name: "redis passthrough"
      listen_addr: "127.0.0.1:6380"
      connection_limit: 3000000
      chain:
        - RedisSinkCluster:
            first_contact_points: [ "127.0.0.1:2220", "127.0.0.1:2221", "127.0.0.1:2222", "127.0.0.1:2223", "127.0.0.1:2224", "127.0.0.1:2225" ]
            connect_timeout_ms: 3000
  • The sources are named "redis" and "redis passthrough"
  • The source chains are named "redis source" and "redis passthrough source"
  • The tee chain is named "redis source -> Tee chain transform"
  • The three RedisSinkCluster transforms are named:
    • "redis source -> Tee chain transform -> RedisSinkCluster"
    • "redis source" -> RedisSinkCluster"
    • "redis passthrough source" -> RedisSinkCluster"

notes:

  • This ensures that every name is unique without the user having to worry about it.
  • Predicting the name that will end up in a metric or log can be tricky, some nice documentation could help a bit though.

User configured names

---
sources:
  - redis:
      name: "redis"
      listen_addr: "127.0.0.1:6379"
      connection_limit: 3000000
      chain:
        - Tee:
            name: "tee"
            behavior: Ignore
            buffer_size: 10000
            chain:
            - QueryTypeFilter:
                name: "query type filter"
                DenyList: [Read]
            - Coalesce:
                name: "coalesce"
                flush_when_buffered_message_count: 2000
                # Use an unreasonably large timeout here so that integration tests dont break on slow hardware or a performance regression
                flush_when_millis_since_last_flush: 1000000000
            - QueryCounter:
                name: "backup chain"
            - RedisSinkCluster:
                name: "backup redis sink"
                first_contact_points: [ "127.0.0.1:2120", "127.0.0.1:2121", "127.0.0.1:2122", "127.0.0.1:2123", "127.0.0.1:2124", "127.0.0.1:2125" ]
                connect_timeout_ms: 3000
        - QueryCounter:
            name: "primary chain"
        - RedisSinkCluster:
            name: "primary redis sink"
            first_contact_points: [ "127.0.0.1:2220", "127.0.0.1:2221", "127.0.0.1:2222", "127.0.0.1:2223", "127.0.0.1:2224", "127.0.0.1:2225" ]
            connect_timeout_ms: 3000
  - redis:
      name: "redis passthrough"
      listen_addr: "127.0.0.1:6380"
      connection_limit: 3000000
      chain:
        - RedisSinkCluster:
            name: "redis passthrough sink"
            first_contact_points: [ "127.0.0.1:2220", "127.0.0.1:2221", "127.0.0.1:2222", "127.0.0.1:2223", "127.0.0.1:2224", "127.0.0.1:2225" ]
            connect_timeout_ms: 3000
  • The sources are named "redis" and "redis passthrough"
  • The source chains are named "redis" and "redis passthrough"
  • The tee chain is named "tee"
  • The three RedisSinkCluster transforms are named:
    • "redis primary sink"
    • "redis backup sink"
    • "redis passthrough sink"

Notes:

  • All names should be taken directly from a user provided string, this makes it predictable to a user what strings they should expect to see in their metrics/tracing.
    • source chains should be taken directly from the source name itself. Even though semantically it would make more sense to name it {source_name} source, doing so makes it harder for the user to predict what exactly the name will be just from reading the config.
    • Transforms that hold multiple chains must allow the user to define each name in the config, generating chain names per chain based on the transform name is not allowed
  • topology validation errors should still format names as e.g. {source_name} source where it makes sense but the "source" will not be included as part of the name.
  • Shotover should fail topology validation if any names are not unique, this gives us the same uniqueness guarantee of the other design.
  • This allows users to give meaningful names to all their transforms/chains, they can give names that hold meaning even without referring to the topology.yaml

Conclusion

Initially I only considered the first design and it seemed obviously better than any alternative.
However after discussion with jack I've come to see that the second design better aligns with the actual needs of users.
I think we'd better go for the second design.