Overview
About me
Philosophy
Governance structure
Frontend
Backend
Data
Caching
Tracing
Protocols and communication patterns
Access control
- Multi tenancy
Testing
DevOps
- Philosophy
- Centralisation
- Tooling
- CI/CD
- Deployment
- Monitoring
QA
- Responsibility
Theories of computing
- Complexity of algorithms
- Concurrency
Data science and machine learning

Overview

A checklist of tech considerations when designing fullstack architecture with SRE in mind.

About me

Since the list is heavily biased towards my tech background, here's a summary of it.

Frontend: Typescript, React, Vue, Angular, Jekyll/Hugo/Wordpress
Backend: Go, Java (SE 5 - 14), Node.js, Scala/Akka, Python 3/Flask/gunicorn, Haskell, C, Ruby
Scripting: Bash, Python 3, Perl
Cloud: AWS 90%, GCP 10%
Virtualisation: VirtualBox/vmware, docker, kubernetes
CI/CD: Circle CI, GitHub Actions, Concourse, Airflow, Jenkins
DB/Big data: PostgreSQL/MySQL/Aurora/Oracle, Cassandra/DynamoDB, MongoDB/Firebase, BigQuery/Snowflake/Parquet (storage format)/Spark+Streaming, Prometheus, Neo4j
Middleware: Kafka/Kinesis/SQS/RabbitMQ/PubSub/ZeroMQ, Redis, Fluentd, Apigee/Kong/Tyk/KrakenD, Auth0, GraphQL, LaunchDarkly
Serverless: AWS Lambda, Cloud Function, Step Functions

Philosophy

Convention over configuration
Consistency over creativity (i.e., least surprise)
Single source of truth (SSOT)
Batteries included
Full configurability, with a default setting that works 90% of time.
Configuration as code
Infrastructure as code
Docs as code
Automate yourself out of a job
Driven by empathy, not ego (fancy features/algorithms never beat a good user experience)
Centralisation isn't evil, chaos is
Simplicity is the ultimate sophistication
Exceptions are not exceptional, they're part of the system and part of the story
Make choices based on problem, not on hype or bias

Governance structure

Governance is concerned with

Ownership of components (e.g., products, services, pieces of infrastructure, codes)
Communication patterns across teams/components
Technical guidance and conflict resolution

The goal is to build a lean and efficient engineering team to support business aspirations (unless engineering is the business, the priority should be business).

Law of engineering: engineering team should not scale linearly with business growth.

Frontend

Web

Language: Typescript + ES6. Full stop.
Frameworks:
- React >= 0.14 with functional components
- Get started with create-react-app
- hooks and context, and when it gets too difficult, redux
- Jest > Mocha
- TestCafe > Cypress > Selenium
Shared common library. bit.dev has good UI and doubles as a package repo. lerna may be hard to use.
yarn > npm
webpack for bundling
GraphQL for data querying
axios for HTTP requests
Single sign-on (SSO)
Progressive web apps (PWA)
Deep linking: a page can be addressed and shared by a link

Apps

ReactNative
Flutter
Native language (Java or Swift)

Backend for frontend (BFF)

One BFF per frontend app.

Usages:

Resource intensive tasks
Complex query/aggregation
Protecting FE from unstable API changes
OAuth 2 authorization code flow
Special protocol (e.g., websocket)

Backend

Follow the 12 factor app
SOLID
Coordination with leader election

Shared common library

The philosophy is to abstract common functionality into a shared layer for better governance and easier upgrades.

Middleware
Context: common fields across all request flows (e.g., trace ID, caller ID)
Common data types (e.g., timestamp, currency, coordinates, country code, error codes)
Authentication
Configuration
Cross service communication
- Protocol abstraction: the underlying protocol used should be hidden from user. This allows easier protocol updates (e.g., HTTP to gRPC)
- Service discovery: services should be addressable by a name.
Handlers base class (e.g., message handler, request handler)
DAO base class
Cache library
Logging
Metrics

Configuration and secrets

Configuration files should live in the same repository as code.
Secrets should go into a secret manager. Deployment infrastructure should inject the secrets at run time.
Allow secret overriding using environment variables (for running locally).

Feature flag

Definition: a feature flag is a toggle that a program uses to decide its behaviour. This is useful when rolling out new features gradually.

A feature flag system has these concepts:

Flag value (aka. toggle value): the value that a program gets for a feature flag
Rule: a rule is associated with a flag, and it maps parameters to a flag value. E.g., a rule can map all users with age < 18 to flag value under-18.
Feature management service: service that stores flags and rules. This service has admin APIs or UI to configure the flags and rules. E.g., LaunchDarkly.

Make an effort to keep the number of feature flags low.

Make a plan to remove unnecessary feature flags. A feature flag is no longer needed if 100% of the traffic is using the new feature.

Logging

Structured logging: variables get their own fields, log messages are static strings
What to log
- Timestamp
- Trace ID
- Caller ID
- Service
- Environment
When to log
- Service starts. Log configurations.
- Service crashes
- Assertions fail (a code path that should never happen)
- Errors are handled
- Log at least one message per happy path
When not to log:
- Error propagation without handling
- Normal code flow
- Duplicates

Serverless

Pros

On demand (cost saving)
No infrastructure
No single point of failure
High scalability

Cons

Not suitable for long running tasks (due to timeout)
Not suitable for resource intensive tasks
Not suitable for programs with local persistence (e.g., memory cache)
Reactive (i.e., requires external triggers to run), not suitable for pro-active tasks (e.g., periodic notification, heartbeating)
Complex workflow with multiple functions needs careful orchestration (e.g., step function)
Logs need different shipping mechanism (because you don't control the VM and cannot install log aggregation daemons)

Data

Ownership

The owner should be the writer
There should be only one writer to a data set (the moderator)
The owner should provide libraries for reading the data. The libraries should hide the low level details of how data is retrieved (e.g., directly from DB, or via the owner).

Versioning

Data should have created and updated timestamps
If multiple data versions can co-exist, several strategies:
- New table: good for isolation, bad for management (especially if tables are created by CD pipeline)
- A version column: good for management, bad for indexing and possible hot partition.
If old version needs to be migrated to new version, consider a tool like AWS Data Pipeline.

Database

Types of database

SQL
- Pros
  - Joint query is easy
  - Transaction is easy
- Cons
  - Schema change is hard
  - Usually poor scalability
NoSQL
- Pros
  - Schema change is easy
  - Easy to scale
- Cons
  - Can only query on indexes
  - No joining, making application code complex
  - Limited transaction support
  - Bad index design can result in hot partitions

SQL DB usually scales computing and storage together, which can be wasteful. An exception is AWS Aurora which scales them independenly.

NoSQL uses sharding to achieve high scalability.

See concurrency for discussion on transaction and data consistency.

Best practices

For microservices, DB should be treated more like working memory than long term source of truth (which should be your data warehouse instead).
Prefer NoSQL than SQL.
Avoid ORM(e.g., Hibernate, SQLAlchemy). They make your code bloated, less clear and fragile.
Always define a DAO layer in application to expose an interface customised to the business logic. This reduces coupling of business logic to DB, and improves testability.
Implement data (un)marshalling in the DAO.

Query

When complex query (multi-fields, condition filter, pagination, sorting, etc) is needed, it's best to keep indexes in a search engine.

This also makes it possible to use a simple DB (e.g., NoSQL).

Examples:

Elasticsearch: index search
Solr: text search
Aloglia: text search

Pagination

Offset vs cursor

Data warehouse

The data is usually structured, with repeated and nested fields (e.g., JSON, YAML).

The DW therefore needs to handle them correctly. Columnar storage following Google Dremel whitepaper is ideal (e.g., Parquet format).

What goes into DW

Trasactional data and Event sourcing: model data change as events, and store the events in DW. Use cases:
- user activity analysis
- trend detection
- usage tracking
Snapshot data: point-in-time data. Use cases:
- account balance
- inventry stock level

Snapshot data may be collected in several ways:

exported from service DB
constructed by playing back transactional data over last snapshot

Best practices

Have a data pipeline architecture as part of infrastructure.
Define schema with version for all data types, with validation rules
Validate incoming data before storing
Common schema fields:
- Timestamp
- Trace ID
- Caller ID
- Service
- Environment
- Dedupe ID
- Is it test? (without this, test and real data is mingled and it's painful to separate them later)
Don't serve data from DW directly. Instead, use a pipeline to ETL the data into a service, then serve it using APIs.

Caching

Cache eviction: prevent out of memory issue. There is a number of strategies, with LRU being the most popular.
Cache invalidation
- time-to-live (TTL)
- Event driven invalidation

Tracing

Choices
- OpenTracing
- Zipkin(written in Scala)
- Jaeger(written in Go)
- AWS X-Ray
What tags to include with trace
- environment
- service
- API/endpoint invoked

Protocols and communication patterns

Types of communication

API definitions

Protocol buffers (abv. protobuf). Usages:
- Remote procedure call (RPC)
- Inter-process communication (IPC)
- Message schema definition
OpenAPI (aka. Swagger)
Thrift, less popular
Avro, less popular

SDKs for different languages can be generated from API definitions.

API design

Implement API gateway to handle:
- API routing
- Protocol translation (e.g., REST to Protobuf)
- Authentication
- Logging/metrics
- Usage auditing
Limit the usage of polymorphic payload (if payload is different in structure, better to make it a different API)
Error response is part of the design, not an after thought.
Standardise and regulate the use of error codes. Adhere to HTTP status code definitions.
Treat HTTP 5xx status as system failures that require intervention (i.e., don't use them lightly).

Service discovery

This ensures that services and APIs are addressable in the infrastructure (by a unique and stable name).

Overlay networks
Address: an abstract concept of where data should be sent, e.g., IP address
Routers: interprets the address and sends traffic to the correct endpoint
DNS server: specialised service that resolves service name to address

Failure and recovery

Access control

Multi tenancy

This is relevant for SaaS systems, where multiple users/customers/partners share the same application and infrastructure but not data.

Data segregation is the most common technique used.

Infrastructure segration does not scale well.

Testing

https://martinfowler.com/articles/practical-test-pyramid.html

Principles

Test the right thing, at the right level (API level > unit level > whole system level). E.g., DAO (having more complexity presumably) deserves more testing than HTTP handlers.
Aim for quality, not coverage. E.g., 90% DAO test coverage with mocked DB isn't better than 70% with real DB.
Higher level tests should be more general, lower level tests should be more specific (e.g., cover edge cases)
You can't cover everything in test, but you can make sure you know how to fix it when it breaks (e.g., with good monitoring/logging)
Use BDD style (i.e., structure test as a scenario) but not BDD itself (i.e., don't do scenario-to-code translation)
Make test data identifiable. Test data should never intefere with real data.
Not be afraid to run test in production. This requires building application with testability in mind.

Load test

Scenario description
- Number of users
- Scenario of each user
- Ramp up/cool down period
Metrics
- Response time
- Throughput
- Error rate
Monitoring: make sure the load generator isn't stressed out, by monitoring its CPU/memory/network
Scaling: runing multiple instances, and aggregate the logs.
Tools
- Gatling: open source, written in Scala, good report UI, own DSL
- JMeter: open source, written in Java, hard to configure
- Locust: open source, written in Python
- NeoLoad: commercial
- BlazeMeter: commercial

Code hygiene

Limit scope of variables
Consistent naming
Declare constants at top
Always use a linter and integrate it into CI
Encourage the use of IDEs
Reproducible builds: Use a package manager that can lock dependency versions

DevOps

Philosophy

Aim for level 5 of CI maturity
Configuration as code
Infrastructure as code
No ad-hoc fixes
Immutable infrastructure (no hot fixes)
Developers should self-service
Pager is the last line of escalation. Use it sparingly.

Centralisation

Components with shared ownership should be considered a piece of infrastructure, and managed in a single place (instead of distributed across repositories/codebases).

Examples:

Service API definitions (service provider != API owner)
Message schema definitions
Documentation
Data in data warehouse

Tooling

CLI and scripts should be the prefered way of automation.

They should be well documented, versioned and published for easy installation. Example: goreleaser

CI/CD

Circle CI
Travis CI
GitHub Actions
Concourse CI: ususally as a deployment pipeline instead of code builder
Jenkins
Bamboo: commercial

Deployment

Deploy from master (never branches)
Canary release
Blue/green deployment
Acceptance test on the full platform

Monitoring

Healthcheck endpoints for long running services
Tracing
Service dependency graph based on traffic and healthcheck. This makes service grade/decommission safer
Service metrics and dashboard

QA

QA > writing test

QA is part of DevOps, not a separate team

Responsibility

Provide tooling/library/framework/process to make low level testing self-serviced by developers (unit tests, component tests, load tests).
Develop and own end user and high level tests, from an organisation or company perspective.
Test automation, reducing manual intervention.
Standardise test methodology across teams.
Reduce noise from fragile tests, false positives, long-time-known bugs, to prevent distraction and increase sensitivity to true positives across the company.

Theories of computing

Complexity of algorithms

Big O notation for both time and space complexity
Cyclomatic complexity: how many logical path a function has

Concurrency

Models of concurrency
- Thread
  - Locks, mutex, semaphor
- Event loop + Asynchronous IO
- Message passing and CSP (including actor systems)
Common errors
- Deadlock
- Race condition
Distributed systems
- CAP theorem
- Concensus algorithms: Paxos, Raft
- Eventual consistency
  - Gossip protocol
  - BASE
  - Warning not suitable for systems requiring ACID, e.g., bank account transfer.
- Leader election
- Sharding
- LSM: used by most NoSQL DB to ensure no data loss

Data science and machine learning

TBC

ryan-ju/fullstack-checklist

Overview

About me

Philosophy

Governance structure

Frontend

Web

Apps

Backend for frontend (BFF)

Backend

Shared common library

Configuration and secrets

Feature flag

Logging

Serverless

Pros

Cons

Data

Ownership

Versioning

Database

Types of database

Best practices

Query

Pagination

Data warehouse

What goes into DW

Best practices

Caching

Tracing

Protocols and communication patterns

Types of communication

API definitions

API design

Service discovery

Failure and recovery

Access control

Multi tenancy

Testing

Principles

Load test

Code hygiene

DevOps

Philosophy

Centralisation

Tooling

CI/CD

Deployment

Monitoring

QA

Responsibility

Theories of computing

Complexity of algorithms

Concurrency

Data science and machine learning