OpenZFS: As a sysadmin I can tune prefetcher for data/metadata so I can optimize IO to workload

Question

OpenZFS: As a sysadmin I can tune prefetcher for data/metadata so I can optimize IO to workload

Opened this issue 2 years ago · 0 comments

Summary

Add configuration tunable{s} and code to the ZFS prefetcher to allow users to enable/disable prefetching for specific data classes (data, metadata), so as to optimize (reduce) IO, and improve cache utilisation (lower ZFS ARC memory usage for blocks that will not be hit) for workloads where they are suitable.

Background

ZFS has a "prefetch" subsystem[1], which when enabled, can read additional data off disk, speculating ahead of time that those blocks may be required and thus be served from the cache. More information on the the prefetch mechanism (note this is the file level prefetch subsystem, and not the vdev cache) can be found here:

ZFS currently only provides a global switch for enabling the prefetcher (vfs.zfs.prefetch.disable).

There are however, subsystem-specific prefetch toggles (again global), for L2ARC, DEDUP, SCRUB:

vfs.zfs.l2arc.noprefetch
vfs.zfs.dedup.prefetch
vfs.zfs.no_scrub_prefetch

These are 'optimizations' already in place for prefetch for particular cases. This feature essentially extends existing capability to data classes (data, metadata).

Motivation

For some workloads, datasets and system configurations (available memory for ZFS, etc), the hit rate of the prefetch for data (or metadata: unverified) can be extremely low, so as to not provide a net benefit over the additional IO involved in prefetching data.

In my testing, at least for my workload (heavy metadata, 8gb memory, vfs.zfs.arc.meta_limit_percent: 100), arcstats.prefetch_data has a hit rate of close to zero (and always < 5%), even below that of the vdev cache which has been disabled by default for "not having a a benefit in most cases".

Note: While the stats below represent only a short uptime, longer uptime/workload stats are identical.

│                       Total     MFU     MRU    Anon     Hdr   L2Hdr   Other
│     ZFS ARC           3902M   1465M    638M   1612M  18484K       0    152M
│
│                                Rate   Hits Misses | Total Rate   Hits Misses
│     arcstats                  : 96%   2079     83 |        98%  2520k  47136
│     arcstats.demand_data      : 89%    388     44 |        96%   769k  24451
│     arcstats.demand_metadata  : 99%   1645     13 |        99%  1732k  14439
│     arcstats.prefetch_data    :  0%      0      7 |         0%      0   2661
│     arcstats.prefetch_metadata: 70%     46     19 |        77%  19421   5585
│     zfetchstats               : 92%     24      2 |        38%   5348   8576
│     arcstats.l2               :  0%      0      0 |         0%      0      0
│     vdev_cache_stats          : 58%     18     13 |        30%   6120  13714

To the extent possible, it would be valuable to be able to limit/disable (to the extent possible) prefetching for 'data' blocks.

Considerations

Some recent prefetcher improvements landed very recently in OpenZFS: openzfs#13452 I have not reviewed the changes in depth, but these improvements appear to be related/limited to better prefetch 'scaling' and 'performance'. The changes however may impact complexity, understanding or design of this feature request, so worth noting here.
Q: Separate tunables for both data and metadata? Having both disabled is tantamount to having prefetch disabled entirely, or is it? Are there any other prefetch things happening that aren't data or metadata? If not and data/metadata is it, how do we name / organize the tunable(s) with respect to naming, values, etc,

Example: prefetch_{data,metadata}: 0|1 vs prefetch_{data,metadata}_disable:1|0. OpenZFS might already have a pattern/policy/guidelines/precedents for this.

Q: My case seeks to disable prefetching data blocks (leaving only prefetching metadata). Is the the reverse ability, to disable metadata prefetching (leaving only data) likely to be useful/valuable? Gut feel says yes, if nothing else than to 'have tunables to configure all available data classes', rather than speculating what may or may not be useful for all/any possible workloads.