Event-based logging

Event-based logging helps to clarify operability of software systems by identifying possible failure modes and key execution events. This repo has examples and principles.

Copyright © 2018-2021 Conflux - Licenced under CC BY-SA 4.0

Definition

Event-based logging uses compile-time enum-based event definitions when logging to make runtime diagnosis and team-member onboarding clearer and easier.

Benefits

When implemented with teams empowered to make useful improvements, event-based logging can:

Increase software reliability through improved definition of software behavior
Provide a common interface between developers and Ops / SRE / live service support, or between one team and other team
Increase operational awareness in developers and Product Managers / Product Owners
Explore software run-time behaviour without actually running the code
Decrease "time-to-diagnose" for live incidents
Clean up logging: reduce "logorrhoea" (verbose logs)
Help to prepare for approaches like Domain-driven Design (DDD), Event-sourcing, and Chaos Engineering / Resilience Engineering
Reduce onboarding time for new team members

Treat logging as a two-way communication channel between people building systems and people running systems; this could be two separate teams or it could be the same team at different times of the week.

Principles

Event-based logging is designed to be a simple, expressive approach to exploring failure modes and real-world operational behaviour for all kinds of software.

Have a single, definitive list of application events in code
Use exactly one of these event codes in the log message when logging
Use an enum type or equivalent for compile-time checking of uniqueness and searchability
Get code-completion from the IDE or REPL when choosing an event to log
Simple SHIFT-select or double-click on an event name to copy/paste into a log search tool - no manual selection of multiple words for copy/paste
Avoid the need for a single cross-team events library by scoping events to specific services.
Avoid the need for complex regex searches in log tools: just search for a single, guaranteed-unique string.

Examples

A key aim is to "lean on the compiler" for compile-time verification of the Event types when logging. This in turn means we get code-completion when choosing an Event type during logging:

// Nodejs example for event-based logging

const Events = Object.freeze({
    UndefinedError : 'UndefinedError',

    // Database events
    DatabaseConnectionSuccess : 'DatabaseConnectionSuccess',
    DatabaseConnectionFailure : 'DatabaseConnectionFailure',
    DatabaseConnectionTimeout : 'DatabaseConnectionTimeout',

    // Parsing events
    ParseStreamUnexpectedToken : 'ParseStreamUnexpectedToken',
    ParseStreamMissingData : 'ParseStreamMissingData',
    ParseStreamSuccess : 'ParseStreamSuccess',

    // Token validation events
    TokenValidationSucceeded : 'TokenValidationSucceeded',
    TokenValidationFailedInvalidParams : 'TokenValidationFailedInvalidParams',
    TokenValidationFailedInvalidDigest : 'TokenValidationFailedInvalidDigest',
    TokenValidationFailedIncorrectSHA : 'TokenValidationFailedIncorrectSHA',

    // Application lifecycle events
    AppStarted : 'AppStarted',
    AppShutdownRequested : 'AppShutdownRequested',

    // Test event
    NoOp : 'NoOp'
  });
  
  // console.log(Events.TokenVal --> auto-complete

Screenshot of code-completion with Events:

See examples of Event definitions:

C#: ApplicationEvents.cs
Node.js: appEvents.js

Service-scoped events

It is very useful to be able to search for similar events across multiple services, especially in large, distributed systems with multiple teams and services. Searching for *FailedToConnect* in a log search tool to find all service connection failures is a powerful observability technique.

However, avoid the temptation to create a single, cross-team library containing all possible events; this introduces coupling between services that introduce blocking dependencies between teams. Instead, use service-scoped (or team-scoped) event names. For example, the Payments team may have this set of events defined:

// PaymentsService events

const Events = Object.freeze({
    PaymentsUndefinedError : 'PaymentsUndefinedError',
    PaymentsFailedToConnectToDatabase : 'PaymentsFailedToConnectToDatabase',
    PaymentsUnexpectedTokenInParseStream : 'PaymentsUnexpectedTokenInParseStream',
  });

  // console.log(Events.PaymentsFai --> auto-complete

The License team may have this set of events defined:

// LicenseService events

const Events = Object.freeze({
    LicenseUndefinedError : 'LicenseUndefinedError',
    LicenseFailedToConnectToDatabase : 'LicenseFailedToConnectToDatabase',
    LicenseUnexpectedTokenInParseStream : 'LicenseUnexpectedTokenInParseStream',
  });

  // console.log(Events.LicenseFai --> auto-complete

We can still search for *UnexpectedToken* events across services when necessary, but without the need for a shared library dependency.

More detailed examples:

C#: CSharp-example.cs
Node.js: NodeJS-example.js

Team Guide to Software Operability by Matthew Skelton, Alex Moore, and Rob Thatcher (2019) 📙

ConfluxHQ/event-based-logging