ubuntu/zsys

Inordinate CPU usage. What is zsys doing? What is it even trying to do?

dcarosone opened this issue · 16 comments

Describe the bug
zsysd spends a lot of time at 100% cpu (seemingly spinning a single core/thread)

To Reproduce

  1. Install a zsys-using ubuntu; this has been an issue since I recall zsys first appearing
  2. Run the system for a while (generally not very long).
  3. Notice the fan running unexpectedly, or notice instances of #193 when doing updates (even initial post-install updates), and investigate.
  4. Notice zsysd at 100% in top.
  5. Notice that there doesn't always seem to be high user time, and it seems like it might be in syscalls - but this may vary.

Expected behavior
A small daemon with not much work to do, sitting quietly and updating boot states when packages are upgraded.

Screenshots

image

Installed versions:
21.04 hirsute (and previous)

❯ zsysctl version
zsysctl	0.5.8
ERROR couldn't connect to zsys daemon: timed out waiting for server handshake 

🙄 #193

# apt-cache policy zsys
zsys:
  Installed: 0.5.8
  Candidate: 0.5.8
  Version table:
 *** 0.5.8 500
        500 http://au.archive.ubuntu.com/ubuntu hirsute/main amd64 Packages
        100 /var/lib/dpkg/status

Additional context

Part of the issue is that zsysd is rather .. opaque. I have no idea what it is actually trying to do, so it's difficult to understand what would make it need to try so hard. I have some suspicions, however, about possible contributing factors and unnecessary work:

  • I think it is looking at datasets well beyond its purview. On some of these systems, I have a lot of datasets and a lot of snapshots, but they should all be outside rpool/ROOT and rpool/USERDATA. I also have my homedir datasets in rpool/home (and USERDATA has only root). It doesn't seem to have helped.
  • I think it is looking at snapshots beyond its purview. I use znapzend to take snapshots and replicated them to other systems and pools (e.g. an external usb disk) for backup. This includes (recursive) snapshots of rpool/ROOT. Though at least some time will be needed to list and match these, it shouldn't be much extra work to just match pattern names and exclude them from consideration.
  • I think it is looking at pools outside its purview, in particular it may be trying to evaluate all the replicated snapshots of rpool/ROOT that have been sent to the backup pool on the usb disk, even though they are on both a difrerent pool and under a different prefix in that pool, the zsys-created snapshots are also replicated there and the (bare) snapshot names match the zsys pattern. I don't know if it's actually part of zsys or another package, but on one system that receives backups from a number of others, update-grub has this ridiculous script that tries to evaluate every snapshot of every pool (even importing pools?!?!) and can take hours, and I wonder if there's something similar going on here.
  • I think it's getting triggered by events from zed (in particular, znapzend making and cleaning up snapshots), and doing a whole lot of work for each event, instead of either batching / caching / deferring work until multiple pending events have been collected, or looking only at more event-specific narrower work.

However, none of the above are necessary to produce this condition, even if they maybe make it worse or more frequent. I have seen it on a freshly-installed test machine, with no local data of consequence and no znapzend yet set up. Sometimes zsys just needs to be killed and will otherwise never recover. Sometimes it recovers. Sometimes I nice it down, and it seemingly eventually exits and restarts, because soon enough it's back at 100% cpu and not nice anymore (this is where I assume it's getting triggered by zed). I'm going to look at making systemd start it niced, but TBH I tend to just stop the service sometime after boot when this happens, or just disable it entirely.

Is there a good description somewhere of what zsys is attempting to do (e.g. on each zed event) and why? Perhaps a large amount of this work is unnecessary, or can be configured for narrower scope?

It doesn't leave any clues about problems in logs, either.
However, the logs it does leave confirm it's spending time looking at datasets that are none of its business:

May 16 07:26:06 zsysd[3510381]: level=warning msg="CanMount property for \"rpool/sync\" has an unexpected source: \"received\""
May 16 07:26:06 zsysd[3510381]: level=warning msg="CanMount property for \"rpool/sync/Movies\" has an unexpected source: \"received\""
May 16 07:26:06 zsysd[3510381]: level=warning msg="CanMount property for \"rpool/sync/kaos\" has an unexpected source: \"received\""
May 16 07:26:07 zsysd[3510381]: level=warning msg="CanMount property for \"rpool/home/dan/.config\" has an unexpected source: \"received\""
May 16 07:26:07 zsysd[3510381]: level=warning msg="CanMount property for \"rpool/data\" has an unexpected source: \"received\""
May 16 07:26:07 zsysd[3510381]: level=warning msg="CanMount property for \"rpool/data/vm\" has an unexpected source: \"received\""
May 16 07:26:07 zsysd[3510381]: level=warning msg="CanMount property for \"oflow/sync/kaos\" has an unexpected source: \"received\""
May 16 07:26:07 zsysd[3510381]: level=warning msg="CanMount property for \"oflow/sync/Movies\" has an unexpected source: \"received\""
May 16 07:26:07 zsysd[3510381]: level=warning msg="CanMount property for \"oflow/sync/cameras\" has an unexpected source: \"received\""

One of the issue (and no, it doesn’t trigger on all ZED events or so, if not started), is that the Go libzfs bindings are working well, but doesn’t scale up with a huge amount of datasets.

For instance, when you open a pool, the binding itself will scan all datasets from that pool in advance, and so, this is where the high CPU usage is seen. One idea would be to write our own binding, tight to the ZSys needs for use case like yours, when you have a huge number of datasets and it doesn’t scale up well.

the zfs command suffers the same scalability issues with large numbers of datasets for the same reason. The "solution" is to use a lua script to do the heavy lifting in-kernel.

Yeah, I'm familiar with "zfs list" being somewhat expensive (znapzend gets hit by this too) but there is something more going on here, like the expensive thing being called many more times than necessary. Very possibly as a kind of side-effect of the bindings layer, where it's less "deliberate" by zsys and harder to avoid.

But how often does zsys "open a pool"? As a daemon, can it not keep the pool open across multiple.. whatever-it's-doing.. to avoid this repeated startup cost? Or maybe the maintainers of the go lib are amenable to alternate interfaces that don't have the same startup behaviour?

As for writing libzfs bindings, I looked into that at one stage (for rust) but got discouraged because libzfs is apparently not considered a stable interface and seems like a potential maintenance trap, and so I wound up just targeting the zfs commands and 'parser-friendly' text output like many others. Disappointing.

Writing a channel program to optimise the operations zsys needs sounds good - but, again, what are these operations?

As explained in the blog post series (IIRC I wrote about it), the daemon wakes up with socket activation, handle the various requests, then idling.
It will timeout after a default time (30s) without any new active request. You can change this value in the configuration yaml file.

Indeed, that was the idea: rewrite the ZFS operation using ZIL to be in kernel memory and only returns the properties we care about, in a single shot, to avoid many Go-C back and force as the current go binding is doing (Note that we patch is already to cut some of those and only care about the operations we are handling, but that’s still not enough when you have many many
thousands of datasets).

All ZFS operations we are using are splitted in its own package: https://github.com/ubuntu/zsys/tree/master/internal/zfs. You can see here that there is even a mock for tests to run against the real libzfs implementation or a mock one (and simulate, that way, failures).

As explained in the blog post series (IIRC I wrote about it), the daemon wakes up with socket activation, handle the various requests, then idling.

I've now read through the blog posts. Highly appreciate the effort put into explaining the thought process of the design, and also many of the actual thoughts. This is good, well-intentioned work, and I now have a better understanding of what zsys is trying to do -- that wasn't so far off my original assumptions.

I also see some options to enable more logging, so I can investigate its actual activity further. But already, I can see some issues along the lines of what I outlined above: zsys is doing more work on more data than seems to be intended.

As I now understand from the blog posts, zsys wants to manage, effectively, [br]pool/(BOOT|ROOT|USERDATA). I'm not using USERDATA (other than for root) so there's little under there.

But even with the logging I pasted above, I see some discrepancies. Why is it:

  • logging stuff about datasets under rpool/sync, rpool/data, rpool/home?
    • these are in rpool and the go bindings are doing some data-gathering on open, but these logs clearly look like zsys is doing some processing and evaluation on them, and likely fetching extra properties along with the source of each
  • logging stuff about datasets under oflow/ ?
    • it shouldn't even need to open this pool

I also don't understand what seems to be waking it up so much, as far as I can tell from the blog descriptions, it should at most be the hourly user snapshot timer. Maybe it's just spending most of an hour (or more) at 100% cpu for each wakeup, and that's why it seems to be running all the time like that? I haven't really looked into it yet, but more logging will be useful to do so.

Even if you think zsysd should examine all pools by default, for generality or to avoid encoding / enforcing naming conventions, can it at least be configurable (now that I also know there is a config file) so that admins can manually limit the pools zsys will open and thus spend cpu and iops on?

I disabled the hourly user timer, in case it was taking longer than an hour to process that event. Killed and stopped zsysd and left everything overnight.

Current status:

# zsysctl service loglevel 1
ERROR couldn't connect to zsys daemon: timed out waiting for server handshake

log level doesn't change, logs contain an endless loop of:

May 20 08:08:48 systemd[1]: Starting ZSYS daemon service...
May 20 08:10:19 systemd[1]: zsysd.service: start operation timed out. Terminating.
May 20 08:10:19 systemd[1]: zsysd.service: Failed with result 'timeout'.
May 20 08:10:19 systemd[1]: Failed to start ZSYS daemon service.

So at least one of the failure modes is it's taking too long to even start, systemd is giving up and killing it - likely before it can process any events off the socket, and then repeating this over and over without making any forward progress. Perhaps events are even building up in the socket queue?

Even ignoring why opening and enumerating pools is so ludicrously expensive, there are clearly requests that can be serviced without ever touching any pools. So it seems like some reordering of operations can help:

  • read the request from the socket earlier, and perhaps it can be serviced without any zfs work. This might be only fairly boring metadata admin ones, but it seems like this would help with responsiveness and #193 in general.
  • processing the request will likely require the expensive operations, but reading the message from the socket in order to find out will at least let it be discarded if that work takes too long and avoid the loop (the client has already given up by this point).

I exported the pools on external usb disks. Now:

  • zsysd is able to start within the time limit given by systemd(and will exit again successfully after idle)
  • However, zsysd is still not ready within the time limit allowed by zsysctl (so gets the #193 error on first invocation)
  • if I wait, watching the logs, until zsysd is ready, subsequent zsysctl commands (such as I have tried, like a user state save) work as long as I do them within the window before idle shutdown

The system at this point has rpool and bpool imported, from its internal NVME ssd. They're not large, or full (and were created recently as part of a fresh install for 21.04):

# zpool list
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
bpool  1.88G   369M  1.51G        -         -    29%    19%  1.00x    ONLINE  -
rpool   472G   323G   149G        -         -    38%    68%  1.03x    ONLINE  -

There are quite a few snapshots, but most of them are outside the zsys namespace:

# zfs list -t snapshot | wc -l
21989
# zfs list -t snapshot -r rpool/ROOT | wc -l
5370
# zfs list -t snapshot -r rpool/USERDATA | wc -l
17

With respect to optimisation opportunities in the previous comment, basically all of them are outside the scope of a user state save, which was the requested operation for which zsysd was started in this instance, and would expect to be started most often given the default timers.

This is all on an idle system, with plenty of free memory, and no znapzend running (so no zfs query and snapshot activity to contend with)

By way of comparison, and establishing the baseline expectations behind my use of "Inordinate" in the issue title, on this same system, when the other pools are mounted, znapzend is able to:

  • recursively snapshot basically everything on rpool (basically each of the second-level dataset parents, like rpool/ROOT, rpool/home, rpool/data, and more), as well as corresponding structures on the external USB pools
  • replicate most of these hierarchies to 2 remote destinations via ssh, and to a local destination on one of the usb disks, which includes enumerating all of the snapshots on both ends each time
  • expire stale snapshots from sources and all destinations, which includes recursive deletes as well as enumerating and fetching details for each of the child datasets individually.

And do all of this:

  • within the space of about 3-5 mins (more only if larger amounts of data need to be sent offsite over network), schduled every 15 mins, and with generally minimal or only-sometimes-noticeable (but acceptable) impact on interactive usage. For example, it might glitch a 60FPS youtube video a couple of times. Most of the actually-noticeable impact is, I suspect, due to the usb disk io.
  • running at nice 10, including all the child zfs and ssh commands - it's not cpu-bound at all.
  • in the same kind of time it takes zsysd sitting at 100% cpu and fail to start up - not niced and given every resource it can and as minimial contention as possible.

Looking at zpool iostat, it doesn't seem to be stuck on much pool io either (ie for uncached metadata). I guess it's hammering syscalls?

All of the above is intended to be constructive and instructive comparison. I'm going to keep digging for specifics, as time and interest allow, but it is abundantly clear to me that zsysd has plenty of opportunity to focus on eliminating a lot of fundamentally unnecessary work and system impact / inefficiency.

To answer some of your questions:

  • You are not quite right in the timeouts, there are 3 different ones: one for the daemon to start, one of the command line to get an answer from the daemon and finally, one idling time when there is no request.
  • What you see is probably the hourly snapshot, which doesn’t seem to be needed for you as you have another software to snapshot it.

As already said, most of the source of issues is when opening the pool, where the Go bindings iterates over all datasets, load their properties and such. This is doing a lot of Go<->C (libzfs) exchanges, which itself, is doing one syscall by request.

This is where the slowliness is going when having thousands of datasets. We can’t limit what dataset in the API. Consequently, the only escape path here is to write in kernel space using ZIL (Lua) and load everything kernel-side, returning it in a single shot for API, and this eliminates this issue.

So, in non go world, the bindings have probably a different behavior and this is why you don’t see the same issue (the Go binding though is really useful and big thanks to the developers creating it, and we wouldn’t have reimplemented at first everything).

Then, ZSys itself, iterating over all loaded datasets (to be able to write persistent and loaded as part of the system) is not an issue by itself (we already timed it).

Those are the reasons why ZFS/ZSys is labelled as experimental in the installer.

On limiting the pools we are iterating, we had requests to not having rpool and bpool hardcoded, which makes sense. However, an option for the user to limit by its configuration makes sense. I’m happy to review and merge any good patch to it with the same level of tests than the rest of the code.

To answer some of your questions:

* You are not quite right in the timeouts, there are 3 different ones: one for the daemon to start, one of the command line to get an answer from the daemon and finally, one idling time when there is no request.

Yeah. In my normal configuration, the first of these is failing. In the minimal configuration, the second fails (but the daemon starts before the first expires.

* What you see is probably the hourly snapshot, which doesn’t seem to be needed for you as you have another software to snapshot it.

Well.. yes, kind of. This is the scenario that makes it happen so much. But in practice, any invocation (including by apt, or the nightly gc) will get into this state. And at least as far as I can see, once it gets there the only way out is to stop the service, because it doesn't make progress past the first startup timeout.

As already said, most of the source of issues is when opening the pool, where the Go bindings iterates over all datasets, load their properties and such. This is doing a lot of Go<->C (libzfs) exchanges, which itself, is doing one syscall by request.

This is where the slowliness is going when having thousands of datasets. We can’t limit what dataset in the API. Consequently, the only escape path here is to write in kernel space using ZIL (Lua) and load everything kernel-side, returning it in a single shot for API, and this eliminates this issue.

So, in non go world, the bindings have probably a different behavior and this is why you don’t see the same issue

Yes. For example, znapzend implements bindings on top of zfs commands, pulling data (or additional detail) more or less on demand (with some caching). It pays a different tradeoff: sometimes having to fetch the same data multiple times (but also more current), but only fetching mostly the specific data needed for a given operation (snapshots or children of particular datasets, etc).

(the Go binding though is really useful and big thanks to the developers creating it, and we wouldn’t have reimplemented at first everything).

Then, ZSys itself, iterating over all loaded datasets (to be able to write persistent and loaded as part of the system) is not an issue by itself (we already timed it).

Yep. Once it's loaded, zsys reports a whole lot of detail (even about unrelated datasets it should have no interest in), and does so quickly. There's no harm in that, if the data is already collected, but the effort of collecting it is the problem. It now seems clear that this effort originates in the dependency chain.

Now I totally empathise and endorse using a good library, there is absolutely no criticism there! But the limits of that library and the assumptions built into its interface are being hit: zsys isn't asking for this data, won't have any use for it (so the cost of fetching it is wasted) and has no interface to control prevent it being fetched (so the cost is unavoidable).

Aside: one possible work-around for me is to adjust the idle timeout so large that, basically, once zsysd starts up, it essentially never exits, and thus later invocations can get an api response in time (e.g. for apt, which currently basically never succeeds). But given the current assumptions of the library, would this then be working from weeks-old cached data? It may be aware of changes made through its own interface, but will it see external changes made to the pool?

Those are the reasons why ZFS/ZSys is labelled as experimental in the installer.

Ultimately the result impacts zsys, and some changes are going to be needed to progress the experiment. Maybe the library authors can help adapt to this use case, maybe zsys needs to implement something different; that implementation and timing discussion is for you.

On limiting the pools we are iterating, we had requests to not having rpool and bpool hardcoded, which makes sense. However, an option for the user to limit by its configuration makes sense. I’m happy to review and merge any good patch to it with the same level of tests than the rest of the code.

That's rather unlikely to come from me, alas I'm no go programmer. At this point, I'm mostly using this issue to journal my own investigations, and seek confirmation of findings (which you've been very helpful with, thanks). Hopefully others with similar questions can find them addressed here too.

Along the way, if there are specific recommendations that seem worthwhile and agreeable, I'm happy to raise them as separate smaller feature requests so they're not buried in here, and perhaps other contributors can pick them up. So far, I see two:

  • config to limit which pools zsysd examines (as here)
  • looking at socket messages earlier in startup, before opening pools, in case the message can be handled without opening pools at all (or only opening specific ones)

zsys isn't asking for this data, won't have any use for it (so the cost of fetching it is wasted)
We still need some data fom "out of scope" datasets. Some example that we hit (and are covered by test) is someone cloning a dataset for instance from rpool/USERDATA/@snapshot to rpool/my_clone.
To be able to have a good asumption when computing garbage collection, we need to know that rpool/my_clone is a dep, and so, pin the dataset.

Aside: one possible work-around for me is to adjust the idle timeout so large that, basically, once zsysd starts up, it essentially never exits, and thus later invocations can get an api response in time (e.g. for apt, which currently basically never succeeds). But given the current assumptions of the library, would this then be working from weeks-old cached data? It may be aware of changes made through its own interface, but will it see external changes made to the pool?
Yes, this is quite possible: Hook up into the zed events you want and call there zsysctl service refresh which will make the daemon refetching the list of datasets.

We can indeed duplicate this bug to the perf one and open smaller tasks.

config to limit which pools zsysd examines (as here)
This could be a separate bug yeah.

looking at socket messages earlier in startup, before opening pools, in case the message can be handled without opening pools at all (or only opening specific ones)

I’m unsure about that one. I really think we first need to get the bindings address and have ZSys having the whole state. There is not only one clients, but many of them, for different users, requesting potentially different thing. The daemon isn’t there just to answer one request most of the time.

So limiting what it sees will be a source of new bugs, and we will need to refresh/reconcile/ensure clones are still detected (which needs the whole dataset maps anyway). I think performance of the bindings is the main blocking point here and we should be able to gather all data in one call.

We still need some data fom "out of scope" datasets. Some example that we hit (and are covered by test) is someone cloning a dataset for instance from rpool/USERDATA/@snapshot to rpool/my_clone.
To be able to have a good asumption when computing garbage collection, we need to know that rpool/my_clone is a dep, and so, pin the dataset.

I know it's only one example, but this one can be addressed within the original namespace, by looking at the clones property of the snapshot. This might be enough by itself, or you might need to go get more info about the clone - but you can look if/when you need it, in this rare case, rather than all the time (once bindings are changed)

Warning and disclaimer I'm total noob when it comes to zfs, it just happens to be installed on my system. Workaround below will destroy snapshots and possibly even more stuff that I'm not aware of. Read and use at your own risk. It solved the symptoms for me, for the cause and a permanent fix you need a more experienced doctor.


I had same issue as described above, zsysd constantly consuming 100% cpu.
Logs would over and over again print

Starting ZSYS daemon service...
zsysd.service: start operation timed out. Terminating.
zsysd.service: Failed with result 'timeout'.
Failed to start ZSYS daemon service.
zsysd.service: Consumed 1min 29.299s CPU time.
Starting ZSYS daemon service...
zsysd.service: start operation timed out. Terminating.
zsysd.service: Failed with result 'timeout'.
Failed to start ZSYS daemon service.
zsysd.service: Consumed 1min 27.824s CPU time.
Starting ZSYS daemon service...

Running zfs list would take up to 5 minutes and return at least 50000 entries (that's max of my terminal scroll buffer), many of them over 1Gb.
Most of the snapshots printed have @autozsys suffix. There was some issue before with this tool when running apt upgrade, that would make a snapshot between every single package, not just before each apt upgrade. I guess that's why there are so many of them.

So i decided to go all nuclear and run following command that removes all snapshots unconditionally

# uncomment next line and run if you don't care about your data
#zfs list -t snapshot | xargs -n zfs destroy -R

Immediatelly after, zsysd seem to start up ok and cpu usage is no longer pegged.

zsysd.service: Consumed 1min 2.940s CPU time.
Starting ZSYS daemon service...
Started ZSYS daemon service.
zsysd.service: Deactivated successfully.
zsysd.service: Consumed 1min 7.825s CPU time.

After this, running zfs list only takes around 3 seconds and only returns a few 375 entries. Rest of system working fine so far.
UPDATE: Docker build and pull got broken by this. Inside /var/lib/docker/image/zfs/layerdb/sha256 there are pointers to the zfs pools. docker prune options were not enough. Completely reinstalling docker-ce and completely wiping /var/lib/docker/ repaired the system, with caveat of all old images being deleted.

#224 seems to be another example

At this point the issue seems to be well understood by the devs, and needs a rewrite (within the library they're using, or within zsysd to access zfs state more directly) to resolve. In the meantime, minimising the number of snapshots kept is a workaround for the inherent scalability problem of grabbing all state data on startup, rather than incrementally / on-demand as other similar tools do.

Removing the snapshots:

zfs list -H -o name -t snapshot -r rpool/ROOT | xargs -n1 zfs destroy -R
zfs list -H -o name -t snapshot -r rpool/USERDATA | xargs -n1 zfs destroy -R