/fullstack-checklist

Engineering checklist for fullstack

Overview

A checklist of tech considerations when designing fullstack architecture with SRE in mind.

About me

Since the list is heavily biased towards my tech background, here's a summary of it.

Philosophy

Governance structure

Governance is concerned with

  • Ownership of components (e.g., products, services, pieces of infrastructure, codes)
  • Communication patterns across teams/components
  • Technical guidance and conflict resolution

The goal is to build a lean and efficient engineering team to support business aspirations (unless engineering is the business, the priority should be business).

Law of engineering: engineering team should not scale linearly with business growth.

Frontend

Web

Apps

Backend for frontend (BFF)

One BFF per frontend app.

Usages:

  • Resource intensive tasks
  • Complex query/aggregation
  • Protecting FE from unstable API changes
  • OAuth 2 authorization code flow
  • Special protocol (e.g., websocket)

Backend

Shared common library

The philosophy is to abstract common functionality into a shared layer for better governance and easier upgrades.

  • Middleware
  • Context: common fields across all request flows (e.g., trace ID, caller ID)
  • Common data types (e.g., timestamp, currency, coordinates, country code, error codes)
  • Authentication
  • Configuration
  • Cross service communication
    • Protocol abstraction: the underlying protocol used should be hidden from user. This allows easier protocol updates (e.g., HTTP to gRPC)
    • Service discovery: services should be addressable by a name.
  • Handlers base class (e.g., message handler, request handler)
  • DAO base class
  • Cache library
  • Logging
  • Metrics

Configuration and secrets

  • Configuration files should live in the same repository as code.
  • Secrets should go into a secret manager. Deployment infrastructure should inject the secrets at run time.
  • Allow secret overriding using environment variables (for running locally).

Feature flag

Definition: a feature flag is a toggle that a program uses to decide its behaviour. This is useful when rolling out new features gradually.

A feature flag system has these concepts:

  • Flag value (aka. toggle value): the value that a program gets for a feature flag
  • Rule: a rule is associated with a flag, and it maps parameters to a flag value. E.g., a rule can map all users with age < 18 to flag value under-18.
  • Feature management service: service that stores flags and rules. This service has admin APIs or UI to configure the flags and rules. E.g., LaunchDarkly.

Make an effort to keep the number of feature flags low.

Make a plan to remove unnecessary feature flags. A feature flag is no longer needed if 100% of the traffic is using the new feature.

Logging

  • Structured logging: variables get their own fields, log messages are static strings
  • What to log
    • Timestamp
    • Trace ID
    • Caller ID
    • Service
    • Environment
  • When to log
    • Service starts. Log configurations.
    • Service crashes
    • Assertions fail (a code path that should never happen)
    • Errors are handled
    • Log at least one message per happy path
  • When not to log:
    • Error propagation without handling
    • Normal code flow
    • Duplicates

Serverless

Pros

  • On demand (cost saving)
  • No infrastructure
  • No single point of failure
  • High scalability

Cons

  • Not suitable for long running tasks (due to timeout)
  • Not suitable for resource intensive tasks
  • Not suitable for programs with local persistence (e.g., memory cache)
  • Reactive (i.e., requires external triggers to run), not suitable for pro-active tasks (e.g., periodic notification, heartbeating)
  • Complex workflow with multiple functions needs careful orchestration (e.g., step function)
  • Logs need different shipping mechanism (because you don't control the VM and cannot install log aggregation daemons)

Data

Ownership

  • The owner should be the writer
  • There should be only one writer to a data set (the moderator)
  • The owner should provide libraries for reading the data. The libraries should hide the low level details of how data is retrieved (e.g., directly from DB, or via the owner).

Versioning

  • Data should have created and updated timestamps
  • If multiple data versions can co-exist, several strategies:
    • New table: good for isolation, bad for management (especially if tables are created by CD pipeline)
    • A version column: good for management, bad for indexing and possible hot partition.
  • If old version needs to be migrated to new version, consider a tool like AWS Data Pipeline.

Database

Types of database

  • SQL
    • Pros
      • Joint query is easy
      • Transaction is easy
    • Cons
      • Schema change is hard
      • Usually poor scalability
  • NoSQL
    • Pros
      • Schema change is easy
      • Easy to scale
    • Cons
      • Can only query on indexes
      • No joining, making application code complex
      • Limited transaction support
      • Bad index design can result in hot partitions

SQL DB usually scales computing and storage together, which can be wasteful. An exception is AWS Aurora which scales them independenly.

NoSQL uses sharding to achieve high scalability.

See concurrency for discussion on transaction and data consistency.

Best practices

  • For microservices, DB should be treated more like working memory than long term source of truth (which should be your data warehouse instead).
  • Prefer NoSQL than SQL.
  • Avoid ORM(e.g., Hibernate, SQLAlchemy). They make your code bloated, less clear and fragile.
  • Always define a DAO layer in application to expose an interface customised to the business logic. This reduces coupling of business logic to DB, and improves testability.
  • Implement data (un)marshalling in the DAO.

Query

When complex query (multi-fields, condition filter, pagination, sorting, etc) is needed, it's best to keep indexes in a search engine.

This also makes it possible to use a simple DB (e.g., NoSQL).

Examples:

Pagination

Data warehouse

The data is usually structured, with repeated and nested fields (e.g., JSON, YAML).

The DW therefore needs to handle them correctly. Columnar storage following Google Dremel whitepaper is ideal (e.g., Parquet format).

What goes into DW

  • Trasactional data and Event sourcing: model data change as events, and store the events in DW. Use cases:
    • user activity analysis
    • trend detection
    • usage tracking
  • Snapshot data: point-in-time data. Use cases:
    • account balance
    • inventry stock level

Snapshot data may be collected in several ways:

  • exported from service DB
  • constructed by playing back transactional data over last snapshot

Best practices

  • Have a data pipeline architecture as part of infrastructure.
  • Define schema with version for all data types, with validation rules
  • Validate incoming data before storing
  • Common schema fields:
    • Timestamp
    • Trace ID
    • Caller ID
    • Service
    • Environment
    • Dedupe ID
    • Is it test? (without this, test and real data is mingled and it's painful to separate them later)
  • Don't serve data from DW directly. Instead, use a pipeline to ETL the data into a service, then serve it using APIs.

Caching

  • Cache eviction: prevent out of memory issue. There is a number of strategies, with LRU being the most popular.
  • Cache invalidation
    • time-to-live (TTL)
    • Event driven invalidation

Tracing

Protocols and communication patterns

Types of communication

API definitions

SDKs for different languages can be generated from API definitions.

API design

  • Implement API gateway to handle:
    • API routing
    • Protocol translation (e.g., REST to Protobuf)
    • Authentication
    • Logging/metrics
    • Usage auditing
  • Limit the usage of polymorphic payload (if payload is different in structure, better to make it a different API)
  • Error response is part of the design, not an after thought.
  • Standardise and regulate the use of error codes. Adhere to HTTP status code definitions.
  • Treat HTTP 5xx status as system failures that require intervention (i.e., don't use them lightly).

Service discovery

This ensures that services and APIs are addressable in the infrastructure (by a unique and stable name).

  • Overlay networks
  • Address: an abstract concept of where data should be sent, e.g., IP address
  • Routers: interprets the address and sends traffic to the correct endpoint
  • DNS server: specialised service that resolves service name to address

Failure and recovery

Access control

Multi tenancy

This is relevant for SaaS systems, where multiple users/customers/partners share the same application and infrastructure but not data.

Data segregation is the most common technique used.

Infrastructure segration does not scale well.

Testing

https://martinfowler.com/articles/practical-test-pyramid.html

Principles

  • Test the right thing, at the right level (API level > unit level > whole system level). E.g., DAO (having more complexity presumably) deserves more testing than HTTP handlers.
  • Aim for quality, not coverage. E.g., 90% DAO test coverage with mocked DB isn't better than 70% with real DB.
  • Higher level tests should be more general, lower level tests should be more specific (e.g., cover edge cases)
  • You can't cover everything in test, but you can make sure you know how to fix it when it breaks (e.g., with good monitoring/logging)
  • Use BDD style (i.e., structure test as a scenario) but not BDD itself (i.e., don't do scenario-to-code translation)
  • Make test data identifiable. Test data should never intefere with real data.
  • Not be afraid to run test in production. This requires building application with testability in mind.

Load test

  • Scenario description
    • Number of users
    • Scenario of each user
    • Ramp up/cool down period
  • Metrics
    • Response time
    • Throughput
    • Error rate
  • Monitoring: make sure the load generator isn't stressed out, by monitoring its CPU/memory/network
  • Scaling: runing multiple instances, and aggregate the logs.
  • Tools
    • Gatling: open source, written in Scala, good report UI, own DSL
    • JMeter: open source, written in Java, hard to configure
    • Locust: open source, written in Python
    • NeoLoad: commercial
    • BlazeMeter: commercial

Code hygiene

  • Limit scope of variables
  • Consistent naming
  • Declare constants at top
  • Always use a linter and integrate it into CI
  • Encourage the use of IDEs
  • Reproducible builds: Use a package manager that can lock dependency versions

DevOps

Philosophy

Centralisation

Components with shared ownership should be considered a piece of infrastructure, and managed in a single place (instead of distributed across repositories/codebases).

Examples:

  • Service API definitions (service provider != API owner)
  • Message schema definitions
  • Documentation
  • Data in data warehouse

Tooling

CLI and scripts should be the prefered way of automation.

They should be well documented, versioned and published for easy installation. Example: goreleaser

CI/CD

Deployment

Monitoring

  • Healthcheck endpoints for long running services
  • Tracing
  • Service dependency graph based on traffic and healthcheck. This makes service grade/decommission safer
  • Service metrics and dashboard

QA

QA > writing test

QA is part of DevOps, not a separate team

Responsibility

  • Provide tooling/library/framework/process to make low level testing self-serviced by developers (unit tests, component tests, load tests).
  • Develop and own end user and high level tests, from an organisation or company perspective.
  • Test automation, reducing manual intervention.
  • Standardise test methodology across teams.
  • Reduce noise from fragile tests, false positives, long-time-known bugs, to prevent distraction and increase sensitivity to true positives across the company.

Theories of computing

Complexity of algorithms

Concurrency

Data science and machine learning

TBC