/soa-checklist

Microservice Oriented Architecture checklist

SOA checklist

Disclaimer - Trust no one, use your brain! (Work continuously in progress)

Administrative (people, flows, responsibilities)

  • Blueprint/template for a new service
  • Documentation, standards, guides (how-to, know-how documents)
  • Team support
    • Understanding of the whole process by each member
    • Pro-active development and support
    • Accepted responsibilities and duties for each stage of a service
  • Plan for service live circle
    • Pre-production development
    • Launching
    • Rollout backward compatible version of a service
    • Hotfixing
    • Rollout backward incompatible version of a service
      • Data migration
      • Switchover
      • Service rollback

Automated processes

  • Continuous development
    • Tests
      • Automated
        • Unit tests
        • Functional tests
        • Code style (lints and sniffers)
        • Code quality monitoring (Sonar, Scrutinizer)
        • Code coverage checks
      • Manual
        • Feature acceptance/Business acceptance
        • A/B tests
    • Conditions of integration
      • Code style checks
      • Test results
      • Code coverage percentage
    • Conditions of disintegration a feature
      • Error rate after deploy live
      • Healthchecks
    • Storing a new tested snapshots/artefact of a service
      • Artefact storage (Docker registry)
        • Cleanup policy (Delete old tags with timeout)
  • Continuous delivery of stable artefacts

Implementation

  • System layers
    • Hardware: Servers and networks
      • Scaling (adding new nodes) should not affect consistency of other layers
      • Degradation (removing nodes) should not affect consistency of other layers
      • Monitoring
        • Hardware
        • Network
        • Resources and load
      • Alerting policy
    • Cluster: Services management system (Kubernetes, alternatives: OpenShift,Apache Mesos/Apache Karaf)
      • Monitoring
        • Availability of each node in the cluster
        • All services up and running
        • Connectivity between different pods and services
        • Public endpoints accessibility
      • Alerting policy
      • Restart (full or partial) should bring cluster and systems up without destruction
      • Log aggregation system - collect all logs from all containers
      • Execution environment
        • Meta-project with topology of the system
          • Showroom + Staging
            • Separate namespace for each showroom
            • Fixed showroom for the staging (last stable pre-release)
          • Production
            • Configuration
              • Secrets
              • Configs should be a part of the meta-project
    • Service: Application and any service
      • Service itself (Docker image)
        • Backward compatibility for a few generations
          • Cleanup policy for deprecated/unused:
            • Logic branches
            • Data structures (RDBMS/NoSql)
        • One container - one process
          • Segregated commands even in one image (management layer can pick any to run)
          • Built in commands
            • Test service/source code (docker compose to setup required test ENV)
          • DEV/DEBUG mode
        • Logging
          • Writing in stdout (without using containers’ file system) will enforce cluster layer to keep all logs
        • Monitoring
          • Application and business checks (New Relic: throughput, metrics)
          • Self health checks (metrics+Prometeus+Grafana)
            • Queues content (amount of messages)
            • Db content (custom checks)
            • Cache utilization check
        • Alerting policies (Prometeus, NewRelic)
        • Tracing system agent (zipkin)
        • Self-sufficiency
          • Interfaces documentation
            • Restful API
            • Port and service description (README.md files)
          • Service should be able to set itself up
            • Wait for required related services and ports (dockerize)
            • Configuring from environment variables (confd)
            • Warming up
              • Run data migration (needed maintenance service)
              • Cache fulfilment
      • Replication, balancing and scaling on service level
      • Failover and self-reorganisation in case of:
        • Service crashed
        • Physical node out of cluster
        • Resources problems on specific node
      • Logs system
        • Service to collect and access logs grabbed from Cluster layer
          • ELK stack/Gray Log/etc
      • Persistent volumes to keep data
        • EBS AWS
        • Ceph
        • NFS
  • Common services
    • Tracing system (Zipkin)
    • Single sign-on service
      • Authentication service (JWT)
      • Authorization requests from all services
    • Detached processing (CQRS)
      • Request-Queue-Processor schema
      • Stream data addressing and processing (Reactor)
    • Real Time data requests processing
      • Reliable data provider/API gateway (sync data retrieving)
        • Request-Manager-Service solution
    • Reliable data-bus for events
      • Event-Broker-Subscriber solution (Apache Camel)
        • Http/TCP API endpoint to accept events
        • Event fulfilment (Earn required information for subscribers)
        • Event delivery
        • Event delivery policies
          • Retry
          • Reque
          • Giveup
    • RDBMS: Postgres cluster
    • DB backups: PG backupper
    • Key-value + Queue: Redis cluster
    • Messages system: Rabbit MQ cluster
    • Healthcheck system
    • Alerting system