Millwright is a tool for configuring, executing, and monitoring multi-container Docker projects.
Millwright uses the Docker Client SDK as a backend to handle builds, execution, and networking. It exchanges heartbeats with the components to detect failures and handles them automatically.
To use it, first build millwright. If you're in the project root you can run
go build -o mw
and you'll have a binary named mw
ready to use.
Note: The user that runs millwright
commands must have permission to access the docker engine socket
(e.g. by being in the docker
user group. For quick fix: sudo usermod -aG docker $LOGNAME
).
Millwright is configured programmatically using the configureComponents
function in internal/config.go
.
Component configurations and dependencies are all specified here. Here's a shortened example of a config.
func configureComponents() []*Component {
dispatcher := &Component{
serviceName: "dispatcher",
runConfig: RunConfiguration{
DockerfilePath: "./dispatcher/Dockerfile",
BuildContextPath: ".",
Env: []string{
fmt.Sprintf("DISPATCHER_MIN_BUFFER=%d", 20),
},
},
dependencies: []*Component{},
}
cpuUsageHandler := &Component{
serviceName: "handler_cpu_usage",
runConfig: RunConfiguration{
DockerfilePath: "./handlercpu/Dockerfile",
BuildContextPath: ".",
Env: []string{
fmt.Sprintf("HANDLER_POLL_DELAY=%d", 500),
fmt.Sprintf("DISPATCHER_HOST=%s", dispatcher.serviceName),
fmt.Sprintf("DISPATCHER_PORT=%d", 8080),
},
},
dependencies: []*Component{dispatcher},
}
return []*Component{dispatcher, cpuUsageHandler}
}
Millwright can create the infrastructure and start monitoring it using this command:
mw start [--force]
Due to the modules not being published this command has to be run in the root project path,
or otherwise it won't be able to find the modules. (This only affects start
, the other commands can be run
from anywhere)
Note that when a millwright is terminated the components can continue running as usual. When started, a millwright checks existing components and knows to only launch the components that are missing or that have failed. It can also take care of components that are running but are attached to the wrong network.
Multiple millwrights can be started in order to have fault tolerance.
By default, when running mw start
the new millwright waits until any existing millwrights have exited.
This behavior is enabled by using Derailleur.
To force a millwright to start immediately without waiting you can supply the --force
flag.
If another millwright is already running it will be forced to quit automatically.
Note that coordination only applies when you start a millwright. The other commands (described below) are executed right away without starting a new millwright.
Millwright can be used to manually inspect what a component is serving on its introspection endpoint using this command:
mw inspect <component_name>
This will return a JSON object that contains the default expvar
published variables, and for the payload handlers, it
also shows their internal state.
Millwright can be used to forcibly stop the container for a specific component using this command:
mw kill <component_name>
Note that no output means it finished successfully.
Millwright can bring down and forcibly remove all the resources it has created using the following command:
mw destroy
Under the hood this relies on the fact that the label used-by=millwright
is added to all the resources.
The following command run the unit tests:
go test ./...
There is also a script that tests millwright commands in various situations. When in the project root, you can run:
cd internal && ./millwright_test.sh
- Partially synchronous network model: messages may be dropped or reordered but can eventually get through if retried, packets are not spoofed/modified in transit.
- Crash-stop failure model: components can handle some failures on their own without crashing, but if they do crash they stop executing forever.
- Non-Byzantine behavior: processes don't deviate from their algorithms and are assumed to be non-malicious.
Here I will discuss how my implementation of this project could be extended for a real-world system.
The real-world equivalent for millwright would be something that can run on multiple nodes, not just locally.
In order to find the (ever-changing) IPs of the containers I am currently making use of the fact that Docker creates DNS names for the services in a user-defined network.
In a real system I would have launched a DNS-SD server that all containers are configured to use for DNS queries. (It would fall back to an external DNS server for queries unrelated to service discovery) When a container is started, it would register itself in the DNS-SD server. This would also make it possible to do round-robin load balancing on the different containers for a given service, or if necessary, to let all containers for a given service find each other (this will be useful below).
Millwrights are already "distributed", but the coordination method that they use requires access to a shared filesystem.
Even if a distributed filesystem (such as Ceph) were used, inotify
wouldn't work there so the current implementation
would still fail.
One way to make this work would be to use a separate coordination service (e.g. like Zookeeper, from which the
Derailleur
implementation was inspired).
Currently, millwright considers a component to have a one-to-one relationship with a container. This approach would be changed to a component's instance having a separate status, while a component in itself is a group of these instances with a desired and current scale.
Currently, I'm just using expvar
to be able to check health and to manually inspect internal states.
In a production system, I could still have a separate health check endpoint for the millwright to use, but for proper
observability I would use a solution that can collect logs, resources usages, etc... from all the components,
persist them (perhaps in a time-series database), and visualize them all in one place.
By having access to resource usages, millwright could do auto-scaling of components in order to handle load spikes, or to save processing power when not needed.
Also, if component instances were aware of their own resource usage and limits, in order to avoid failures like for example an OOM kill, a threshold near the resource limit could be added, during which a component would not process any new requests but only respond with a failure status code (such as 503) that would make the corresponding consumers do exponential backoff until auto-scaling kicks in.
Minimal configuration changes can (sort of) be done with the current implementation. If the millwright config is changed
and a millwright with this new config is built and run with --force
it would replace the current millwright and try to
reach a state it is happy with. However, this is currently quite limited and a proper solution would have to be
implemented in a production system.
In the current implementation, a component can be redeployed after a change by doing mw kill <component>
which
would bring it down and then the millwright which is running would detect that as a failure and build it again (this time
with the new version). This, however, is not enough because in a production system you would want to make such changes
without any downtime. Currently, millwright has no notion of versioning for the components so that would have to
change.