balena-os/balena-supervisor

/v2/applications/:appId/restart-service recreates the service

hedss opened this issue · 9 comments

hedss commented

Whilst the start endpoint starts a service without recreating it, the restart endpoint actually recreates the container. This isn't consistent, and potentially constitutes a problem.

[pdcastro] This issue has attached support thread https://jel.ly.fish/ca7430c4-a2cd-44e1-a234-7bd64d643231

For what it is worth (maybe not much), based on my playing with the web dashboard and ssh to a NUC device and tests on my laptop (but not having looked at supervisor source code), I gather that:

  • docker restart = docker stop + docker start
  • Dashboard / supervisor's stop = docker stop
  • Dashboard / supervisor's start = docker start
  • Dashboard / supervisor's restart = docker rm -f + docker run

I am not sure about the design rationale behind removing + recreating containers on restart, but the docker restart command does not do this, so it would be better if we emulated this. The change would theoretically just be to add a removeContainer kill option to this line as follows:

await serviceManager.kill(service, {wait: true, removeContainer: false});

To clarify further, as I've just investigated all possible places where restart happens:
1. Clicking the Restart button on the summary page AND the action page recreate ALL containers.
2. Clicking the Restart button by a single service in the services table on the summary page does NOT recreate that container.

All the current restart Supervisor endpoints are as follows:

- /v1/restart - restarts all services, removes & recreates, uses doRestart under the hood
- /v2/applications/:appId/restart - restarts all services, removes & recreates, uses doRestart under the hood
- /v2/applications/:appId/restart-service - restarts individual services, does NOT remove & recreate, uses handleServiceAction under the hood

Per the May 4th balena-io product call, we should unify this behavior on the Supervisor, so that all restart actions remove + recreate containers.

EDIT: Not removing the above so that there's context, but my earlier statement is wrong. Currently, all restart routes recreate the service, and this is intended. Supervisor's default behavior when killing a container with the service-manager is to remove the container. Apologies for any misconceptions my earlier statement may have perpetuated. See the internal call documentation here for further clarification.

hedss commented

Been a while since I raised this, but I'm also considering using balena for a project, and I'd like to know the rationale behind recreating the containers. Certainly when @CameronDiver and I originally discussed this, we agreed this was not ideal and was probably originally in error, because:

  1. This is not how Docker behaves, and this actually makes a huge difference in development.
  2. You could potentially end up in a situation where you blow away overlays you might otherwise expect to be populated.

Thanks!

Hi @hedss, thanks for bringing this point up. I realize that in my previous comment from May 5, I wasn't very clear about balena's rationale behind implementing restart as a recreate action. Let me try to explain, as I realize that the internal call documentation I linked may not be accessible to everyone:

You mentioned that service recreation on restart may have originally been implemented in error. I don't believe any of the current Supervisor maintainers were there to see this implementation get merged, so I can't comment on that. However, in our previous internal call over this, we decided to stick with this implementation of the restart action recreating the containers, even if it's divergent from Docker's behavior. We discussed some points:

  • Based on Docker's design philosophies, containers are meant to be ephemeral. This means that a new container should be a drop-in replacement for an old container with minimal to no impact. When the Supervisor removes containers on restart, we're simply adhering to this philosophy.

  • By removing containers with restart, we're encouraging users to follow best practices in Docker data persistence, as listed out by Docker here. With balena's OS & engine, users may use one of volumes, bind mounts, or tmpfs for data persistence. These strategies also offer a performance boost over storing files in the container's writable layer. We link to this Docker page for data persistence strategies in our docs for the restart action here.

@dfunckt with an important reason:

  • At this point, users are familiar with the restart behavior as it stands today. Fixing this behavior for the sake of aligning it with Docker's restart behavior might not be worth the tradeoff of user confusion at the sudden change.

@pdcastro weighed in with another important reason:

  • Removal + recreation is useful in a scenario where recreating a container allows recovery from application bugs where the application got stuck in a bad state that was persisted to the container's filesystem. In this scenario, data loss is a good thing, as the data consists of a bad state, and it's not immediately clear how restart as stop followed by immediate start without discarding persisted state would be useful here.

Note that users may temporarily stop a container without removing it by using the Supervisor API's /v2/applications/:appId/stop-service and /v2/applications/:appId/start-service endpoints (see here), or click the "stop" and "start" icons in the service table on the dashboard's device summary page. The Supervisor doesn't trigger these actions unless the user hits these API endpoints (via dashboard or Supervisor API), so this offers users a way of stopping without removing, if this suits a particular development workflow.

Let us know if this works for you!

hedss commented

Thanks for the followup @cywang117!

I can completely understand the comments (and hello also to Akis and Paulo if you're reading, hope all is well!), especially the expectation from users in consistent behaviour.

Whilst containers are meant to be ephemeral, Docker compose deals with large orchestrations (as does balena multicontainer projects) and part of the reason I originally raised this was that should I need to carry out development and debug of an entire system (eg. BoB, though I have no idea if you're still using that in your internal development workflow), I don't always want to 'start from scratch'. In Docker, I can simply docker-compose up after changing/rebuilding an individual service, which is then also a 'drop-in' to the rest of the system in the state I originally left it.

In balena, that's not possible, the system as a whole is recreated and I have to start from scratch as the FS overlays have been wiped. This is particularly frustrating in stateful systems where I then need to go through all the steps to get to a particular state again to ensure I've fixed the problem with an individual service. And for clarity, I'm more interested in development (such as local) rather than production and uploading to the balena builder/registry.

The workaround is to use restart-service, but it's not ideal and it requires extra control that isn't part of the default workflow.

Anyway, thanks for the update. I'd suggest closing this issue now if this is a 'No Fix'. Thanks!

Thanks for the context @hedss. I can understand the frustration that comes with all your work being erased on restart in this scenario.

Since you're more interested in local mode/livepush functionality that's currently lacking, would something like balena push <LOCAL_IP> with a flag like --track, which only watches the filesystems of the service(s) specified by the flag and rebuilds accordingly, resolve this issue for you? If so, there is a CLI issue open for it right now: balena-io/balena-cli#2286

hedss commented

Hi @cywang117 !

Yep, that'd do it.

Thanks!