/castellum

Vertical autoscaling service for OpenStack

Primary LanguageGoApache License 2.0Apache-2.0

Castellum

CI Coverage Status Go Report Card

Castellum is a vertical autoscaling service for OpenStack. It can perform autonomous resize operations, upscaling assets with a high usage and downscaling assets with a low usage.

In this document:

In other documents:

Terminology

  • An asset is a thing that Castellum can resize. Assets have a size and usage, such that 0 <= usage <= size.
    • example: "NFS share 2180c598-58f3-4d1d-b03e-303db22de1be"
  • A resource is the sum of all assets in a certain authentication scope. Castellum's behavior is configured at this level, e.g. thresholds and resizing steps. See API specification for details.
    • example: "NFS shares in project 5ceb23209bef4292b9ec97eb3e664f74"
  • An operation is a single resize performed by Castellum.

Operations move through the following states:

State machine

  • Created: The asset's usage has crossed one of the thresholds configured on the resource.
  • Confirmed: The asset's usage has stayed at problematic levels for the configured delay. (For the critical threshold, there is no delay, so operations move from "created" to "confirmed" automatically.)
  • Greenlit: The operation has been approved by a user. (If no approval requirement is configured, operations move from "confirmed" into "greenlit" automatically.)
  • Cancelled: While an operation was not yet greenlit, the asset's usage moved back to normal levels.
  • Succeeded: The resize operation was completed successfully.
  • Failed/Errored: The resize operation was attempted, but failed or errored.

Problems with resizing fall into two categories: Failures need to be addressed by the project/domain administrators (e.g. upsize failed because of insufficient quota), while errors are unexpected backend errors that the OpenStack administrator needs to take care of (e.g. outage of an API used by Castellum).

Building and running

Build with make && make install, or with docker build if that's to your taste. Castellum has three different components that you all need to run for a complete installation:

  • castellum api <config-file> provides an OpenStack-style HTTP-based REST API. To add TLS, put this behind a reverse proxy.
  • castellum observer <config-file> discovers assets and (based on their status) creates, confirms and cancels resize operations.
  • castellum worker <config-file> performs the actual resizing.

The API and worker components can be scaled horizontally at will. The observer cannot be scaled. Do not run more than one instance of it at a time.

The API component has audit trail support and can be configured to send audit events to a RabbitMQ server.

All components receive configuration via environment variables. The following variables are recognized:

Variable Default Explanation
CASTELLUM_ASSET_MANAGERS (required) A comma-separated list of all asset managers that can be enabled. This configures what kinds of assets Castellum can handle. See docs/asset-managers/ for which asset managers exist.
CASTELLUM_DB_USERNAME postgres Username of the user that Castellum should use to connect to the database.
CASTELLUM_DB_PASSWORD (optional) Password for the specified user.
CASTELLUM_DB_HOSTNAME localhost Hostname of the database server.
CASTELLUM_DB_PORT 5432 Port on which the PostgreSQL service is running on.
CASTELLUM_DB_NAME castellum The name of the database.
CASTELLUM_DB_CONNECTION_OPTIONS (optional) Database connection options.
CASTELLUM_HTTP_LISTEN_ADDRESS :8080 Listen address for the internal HTTP server. For castellum observer/worker, this just exposes Prometheus metrics on /metrics. For castellum api, this also exposes the REST API.
CASTELLUM_LOG_SCRAPES false Whether to write a log line for each asset scrape operation. This can be useful to debug situations where Castellum does not create operations when it should, but it generates a lot of log traffic (one line per asset per 5 minutes, which e.g. for 2000 assets is about 1 GiB per week).
CASTELLUM_OSLO_POLICY_PATH
(API only)
(required) Path to the policy.json file for this service. See Oslo policy for details.
CASTELLUM_RABBITMQ_QUEUE_NAME
(API only)
(required for enabling audit trail) Name for the queue that will hold the audit events. The events are published to the default exchange.
CASTELLUM_RABBITMQ_USERNAME
(API only)
guest RabbitMQ Username.
CASTELLUM_RABBITMQ_PASSWORD
(API only)
guest Password for the specified user.
CASTELLUM_RABBITMQ_HOSTNAME
(API only)
localhost Hostname of the RabbitMQ server.
CASTELLUM_RABBITMQ_PORT
(API only)
5672 Port number to which the underlying connection is made.
CASTELLUM_AUDIT_SILENT
(API only)
false Disable audit event logging to standard output.
OS_... (required) A full set of OpenStack auth environment variables for Castellum's service user. See documentation for openstackclient for details.

All components also expect a positional argument containing the path of a YAML configuration file. Below is a working example for a configuration file:

max_asset_sizes:
  - asset_type: 'nfs-shares(-group:.+)?'
    value: 16384

project_seeds:
  - project_name: myproject
    domain_name: mydomain
    resources:
      nfs-shares:
        critical_threshold: { usage_percent: 95 }
        size_steps: { percent: 10 }
        size_constraints: { max_size: 8192 }
    disabled_resources:
      - 'project-quota:.*'

The following fields are allowed:

Field Type Explanation
max_asset_sizes array of objects If present, resource configurations for matching asset types will only be allowed if they include a compatible max_size constraint. If multiple constraints apply to the same resource, later constraints override earlier ones.
max_asset_sizes[].asset_type regex Regex that specifies which asset types this constraint applies to.
max_asset_sizes[].scope_uuid string If present, the constraint only applies to resources with exactly this scope_uuid value. This can be used to override a general constraint for a specific project or domain.
max_asset_sizes[].value integer Highest permissible value for the max_size constraint on matching resources.
project_seeds array of objects Specification of projects that will have resources configured. The observer will apply these seeds, and the API will reject attempts to manually override the seeded configuration.
project_seeds[].project_name string Name (not ID!) of the project.
project_seeds[].domain_name string Name (not ID!) of the domain containing the project.
project_seeds[].resources.$type object Specification of a resource that will be statically configured in this project. The contents of this object must be identical to the payload that will be accepted for PUT /v1/projects/$project_id/resources/$type. See API spec for details.
project_seeds[].disabled_resources list of strings A list of regexes. Any asset type that matches one of these regexes will have autoscaling disabled and forbidden in this project. This can be used to delete resources that were configured by an earlier version of the seed.

All regexes are matched against the entire asset type string, i.e. a leading ^ and trailing $ are always added implicitly.

When applying project seeds, projects that do not exist in Keystone will be skipped without logging an error.

Oslo policy

Castellum understands access rules in the oslo.policy JSON format. An example can be seen at docs/example-policy.json. The following rules are expected:

  • project:access gates access to all endpoints relating to a project, even if more specific rules are checked later on.
  • project:show:<asset_type_shortened> gates access to all endpoints relating to a project resource.
  • project:edit:<asset_type_shortened> gates access to the PUT and DELETE endpoints relating to a project resource.

All project-level policy rules can use the following object attributes:

%(project_id)s           <- deprecated, use the next one instead
%(target.project.id)s
%(target.project.name)s
%(target.project.domain.id)s
%(target.project.domain.name)s

When policy rule names reference the asset type, only the part of the asset type up until the first colon is used. For example, access to project resources with asset type project-quota:compute:instances would be gated by the rules project:show:project-quota and project:edit:project-quota.

See also: List of available API attributes

Prometheus metrics

Each component (API, observer and worker) exposes Prometheus metrics via HTTP, on the /metrics endpoint. The following metrics are exposed:

Metric/Component Description
castellum_operation_state_transitions
(API, observer, worker)
Counter for state transitions of operations.
Labels: project_id, asset (asset type), from_state and to_state.
castellum_has_project_resource
(observer)
Constant value of 1 for each existing project resource. This can be used in alert expressions to distinguish resources with autoscaling from resources without autoscaling.
Labels: project_id, asset (asset type).
castellum_resource_scrapes
(observer)
Counter for executed resource scrape operations.
Labels: asset (asset type), task_outcome (either failure or success).
castellum_asset_scrapes
(observer)
Counter for executed asset scrape operations.
Labels: asset (asset type), task_outcome (either failure or success).
castellum_asset_resizes
(worker)
Counter for asset resize operations (see below for semantics notes).
Labels: asset (asset type), task_outcome (either failure or success).

Note that castellum_asset_resizes{task_outcome="success"} is incremented whenever a PendingOperation is consumed and converted into a FinishedOperation, even if that operation moved into state "failed" or "errored". The counter castellum_asset_resizes{task_outcome="failure"} is only incremented when a greenlit operation cannot be moved out of the "greenlit" state at all. Resize operations that move into state "failed" or "errored" are counted by castellum_operation_state_transitions{to_state=~"failed|errored"}.