ubuntu/zsys

zsys-commit systemd service fails and system is in degraded state

Lockszmith-GH opened this issue ยท 16 comments

Describe the bug
Running > systemctl is-system-running returns degraded
zsys-commit systemd service fails and system is in degraded state

To Reproduce
Don't really know yet how I got here.

Expected behavior
> systemctl is-system-running should return running

Error dump:
ubuntu-bug zsys report available at: https://my.lksz.me/s/pNdTSk8sxz6CY6J (link will expire May 15 )

journalctl -xe output
Running: sudo journalctl -xe --unit=zsys-commit.service

May 08 14:21:11 szliving systemd[1]: Starting Mark current ZSYS boot as successful...
-- Subject: A start job for unit zsys-commit.service has begun execution
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- A start job for unit zsys-commit.service has begun execution.
-- 
-- The job identifier is 5562.
May 08 14:21:17 szliving zsysctl[376899]: WARNING An error occurred when reverting a Zfs transaction: couldn't promote "rpool/ROOT/ubuntu_ydj9gl/var/lib/d72a7b51a2302ce64ebb221e6c4c5db7b40f183bcca1500a85c0cd7b1e959481" for cleanup: couldn't refresh our internal origin and layout cache: cannot open new snapshot "rpool/ROOT/ubuntu_ydj9gl/var/lib/d72a7b51a2302ce64ebb221e6c4c5db7b40f183bcca1500a85c0cd7b1e959481@150380519": dataset does not exist - rpool/ROOT/ubuntu_ydj9gl/var/lib/d72a7b51a2302ce64ebb221e6c4c5db7b40f183bcca1500a85c0cd7b1e959481@150380519
May 08 14:21:17 szliving zsysctl[376899]: level=error msg="couldn't commit: couldn't promote dataset \"rpool/ROOT/ubuntu_ydj9gl/var/lib/0020e473cc4360e265a517ba68a4064db701e238487619767fae4a8d02b82b64\": couldn't refresh our internal origin and layout cache: cannot open new snapshot \"rpool/ROOT/ubuntu_ydj9gl/var/lib/0020e473cc4360e265a517ba68a4064db701e238487619767fae4a8d02b82b64@885281819\": dataset does not exist - rpool/ROOT/ubuntu_ydj9gl/var/lib/0020e473cc4360e265a517ba68a4064db701e238487619767fae4a8d02b82b64@885281819"
May 08 14:21:17 szliving systemd[1]: zsys-commit.service: Main process exited, code=exited, status=1/FAILURE
-- Subject: Unit process exited
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- An ExecStart= process belonging to unit zsys-commit.service has exited.
-- 
-- The process' exit code is 'exited' and its exit status is 1.
May 08 14:21:17 szliving systemd[1]: zsys-commit.service: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- The unit zsys-commit.service has entered the 'failed' state with result 'exit-code'.
May 08 14:21:17 szliving systemd[1]: Failed to start Mark current ZSYS boot as successful.
-- Subject: A start job for unit zsys-commit.service has failed
-- Defined-By: systemd
-- Support: http://www.ubuntu.com/support
-- 
-- A start job for unit zsys-commit.service has finished with a failure.
-- 
-- The job identifier is 5562 and the job result is failed.

Installed versions:

  • OS: 20.04 LTS (Focal Fossa)
  • Zsysd running version: zsysctl 0.4.5

Additional context
The system is a rather new fresh install (about a week), I have installed a bunch of packages on it, as I'm running home-assistant on this machine, so docker is the most prominent component probably.
I did add an additional user, and made sure that user has a new USERDATA filesystem (used zsysctl userdata create...)

I think this has to do with the docker zfs driver similar to #102, when zfs is present, it's activated by default.
I'll try to move /var/lib/docker to it's own dedicated dataset/filesystem.
If that solves the problem, I'll close the issue.

OK, docker zfs driver was the culprit. I solved it by destroying older docker-created zfs filesystem and mounting /var/lib/docker on a brand-new zfs file-system.

Below are step-by-step instructions for anyone looking to solve this:

NOTE You will need to re-pull images, and if you have any volumes, you might want to back those up. These instructions assume you understand the commands below are destructive and you know what they mean. Don't follow blindly.

If you've already installed docker, you should first stop docker:

sudo systemctl stop docker.service

Now make a backup copy of the content in /var/lib/docker and destroy all docker related zfs file-systems

# make a copy of /var/lib/docker
sudo cp -au /var/lib/docker /var/lib/docker.bk
# assuming all of the mountpoint=legacy zfs file systems are docker's, destroy them
zfs list -r -o name,mountpoint /var/lib | grep legacy | awk '{print $1}' | xargs -n 1 sudo zfs destroy -R
sudo rm -rf /var/lib/docker

Choose one of the following options:

  • create a new file-system on the existing rpool dataset:
    sudo zfs create -o mountpoint=/var/lib/docker rpool/docker
  • create a new data-set (in the example it's zpool-docker) on new hardware
    sudo zpool create -f zpool-docker -m /var/lib/docker /dev/xvdf /dev/xvdg

Depending on whether you are on a new system or one where docker already exists:

  • Install docker - new system:
    sudo apt install docker.io
    sudo systemctl enable docker.service
  • Start the docker service
    sudo systemctl start docker.service

Check out docker documentation more information about the docker zfs storage driver

Thanks a lot for the detailed analyzes and fixes! I plan to patch our docker debian package in ubuntu to create a separate dataset automatically for you.

Can you confirm that you installed the package from the ubuntu archive? (that would help to know if we should fix somewhere else as well)
Would you be interested in the meantime to turn that into a wiki page on this repo?

Thanks again :)

Yes, I used docker.io package.

I have not installed any custom repositories.

confirmed working, thanks!

If you are interested, we have pushed a docker.io package in groovy that should fix it. We will SRU it to 20.04 LTS soon.

Thanks for checking this out and the details which helped getting to the bottom of this :)

I finally got to a fresh system, and installed docker.io on it.
After running a docker run command, I tested zsysctl state save --system and it didn't break this time.

However, I'm unclear of whether this is the solution you've mentioned @didrocks, wouldn't separating the docker system from the actual system be a better solution?

@Lockszmith This is exactly what the current docker.io package from gorilla is doing, we plan to SRU it to 20.04 LTS, which will handle the migration. Once ready, do you mind giving it a try?

I'm probably missing the jargon - sorry (I'm rather new to Ubuntu bug reporting) been a passive user for ages, but now really diving in (mainly because of it's zfs integration) ๐Ÿ˜‰

So doing some research, this is what I understood now:

You have it in the in-development version (20.10 aka Groovy Gorilla) and testing it there, you plan to add it to 20.04 LTS (aka Focal Fossa) via SRU.

The solution will move the docker storage into a dataset of it's own when installing docker.io.

Hope I got everything right. If the above is precise, I missed a few details and thought this has already been pushed - I see it has yet to. (which is fine, just my misunderstanding)

I would be more than happy to test it when it's available, though I'll probably do it on a test-VM instead of my active system - as I've got that stabilized right now.

Hope I got everything right. If the above is precise, I missed a few details and thought this has already been pushed - I see it has yet to. (which is fine, just my misunderstanding)

Thatโ€™s exactly right!

I would be more than happy to test it when it's available, though I'll probably do it on a test-VM instead of my active system - as I've got that stabilized right now.

Thanks a lot! I will ping you here once this is available :)

I have a situation very similar to this one. The difference is that my /var/lib/docker has always been in a persistent dataset, so I have no idea whether the culprit is docker or something else, so I need help to understand.

The only weirdness related to docker happened a few weeks ago with the /var/lib/docker dir. It was empty after a suspend/resume cycle, so I pulled just a few images that I was using, and it was working, so I thought it was not a mayor problem. However after a restart, the whole docker data was back, complete and working so I realized that probably the docker dataset was just not mounted before the restart.

Thinking that everything was fine, I didn't investigate further, but now - after upgrading to 21.04 - I've found the same error of #112. The problem is that I cannot revert to any previous snapshot... going back weeks.

@Lockszmith is that consistent with what you described in the ticket? Do you think it is the same problem caused by a temporarily missing persistent dataset? Could you add some more detail about the behavior? Thanks

My issue was later identified as being caused by the fact that there were way too many snapshots for zsys to enumerate through.
If you search the issues, you'll see a few of those.
If you have docker using the zfs driver and the folder is in an unmanaged location - you should be fine. But it doesn't mean you don't have other snapshots causing it to slow down zsys's operation.

are you using another type of zfs snapshotting system (like sanoid?)

@Lockszmith thank for the answer. I don't use anything else for snapshots..

I think that a possible explanation for my case has been the missed mount of the unmanaged dataset, and the use of docker in that state.

Could that have created a bogus managed dataset in the same location, took a snapshot, and then after the restart the old unmanaged dataset resurfaced and got mounted in its place, so shadowing the bogus one?

That way the bogus one would have been missing from the snapshot (the complaint of zsys/zsf) hence causing a persistent problem. Do you think it makes sense?

After that I decided to reinstall the OS also because I needed a different sized layout (12GB boot partition), so I avoided to actually solve the problem ๐Ÿ˜„

I have a situation very similar to this one. The difference is that my /var/lib/docker has always been in a persistent dataset...

I think I misread this, thinking the dataset was unmanaged - meaning it was outside the rpool/ROOT/ubuntu.... one.

If it was within a managed one, then yes, this is exactly the same issue discussed here.

It was unmanaged (aka persistent) but it didn't mount, and I used docker in that state, so I am thinking it used a managed one temporarily, and it got shadowed by the unmanaged at the next restart.
Just guessing, but is the manager ended up in s snapshot, it wouldn't be found from then on

That's a pretty good guess.

I also encountered a similar issue with another system, freshly installed 20.04 LTS but with a ZFS 2.0.x pool that couldn't be imported because of incompatible new features switched on.
After switching to the 'normal' update channel and upgrading the OS to 21.04 and mounting the pool, everything worked. (I'm guessing zsys also got an upgrade along with the OS which didn't hurt it)

Anyway, as long as things work now, that's a good thing. Glad things worked out.