OpenZFS: As a sysadmin I can tune prefetcher for data/metadata so I can optimize IO to workload
Opened this issue · 0 comments
Summary
Add configuration tunable{s} and code to the ZFS prefetcher to allow users to enable/disable prefetching for specific data classes (data, metadata), so as to optimize (reduce) IO, and improve cache utilisation (lower ZFS ARC memory usage for blocks that will not be hit) for workloads where they are suitable.
Background
ZFS has a "prefetch" subsystem[1], which when enabled, can read additional data off disk, speculating ahead of time that those blocks may be required and thus be served from the cache. More information on the the prefetch mechanism (note this is the file level prefetch
subsystem, and not the vdev cache
) can be found here:
- [1] https://docs.oracle.com/cd/E26505_01/html/E37386/chapterzfs-4.html
- [2] https://cuddletech.com/2009/05/understanding-zfs-prefetch/
ZFS currently only provides a global switch for enabling the prefetcher (vfs.zfs.prefetch.disable
).
There are however, subsystem-specific prefetch toggles (again global), for L2ARC, DEDUP, SCRUB:
vfs.zfs.l2arc.noprefetch
vfs.zfs.dedup.prefetch
vfs.zfs.no_scrub_prefetch
These are 'optimizations' already in place for prefetch for particular cases. This feature essentially extends existing capability to data classes (data, metadata).
Motivation
For some workloads, datasets and system configurations (available memory for ZFS, etc), the hit rate of the prefetch for data (or metadata: unverified) can be extremely low, so as to not provide a net benefit over the additional IO involved in prefetching data.
In my testing, at least for my workload (heavy metadata, 8gb memory, vfs.zfs.arc.meta_limit_percent: 100
), arcstats.prefetch_data
has a hit rate of close to zero (and always < 5%), even below that of the vdev cache which has been disabled by default for "not having a a benefit in most cases".
Note: While the stats below represent only a short uptime, longer uptime/workload stats are identical.
│ Total MFU MRU Anon Hdr L2Hdr Other
│ ZFS ARC 3902M 1465M 638M 1612M 18484K 0 152M
│
│ Rate Hits Misses | Total Rate Hits Misses
│ arcstats : 96% 2079 83 | 98% 2520k 47136
│ arcstats.demand_data : 89% 388 44 | 96% 769k 24451
│ arcstats.demand_metadata : 99% 1645 13 | 99% 1732k 14439
│ arcstats.prefetch_data : 0% 0 7 | 0% 0 2661
│ arcstats.prefetch_metadata: 70% 46 19 | 77% 19421 5585
│ zfetchstats : 92% 24 2 | 38% 5348 8576
│ arcstats.l2 : 0% 0 0 | 0% 0 0
│ vdev_cache_stats : 58% 18 13 | 30% 6120 13714
To the extent possible, it would be valuable to be able to limit/disable (to the extent possible) prefetching for 'data' blocks.
Considerations
- Some recent prefetcher improvements landed very recently in OpenZFS: openzfs#13452 I have not reviewed the changes in depth, but these improvements appear to be related/limited to better prefetch 'scaling' and 'performance'. The changes however may impact complexity, understanding or design of this feature request, so worth noting here.
- Q: Separate tunables for both data and metadata? Having both disabled is tantamount to having prefetch disabled entirely, or is it? Are there any other prefetch things happening that aren't data or metadata? If not and data/metadata is it, how do we name / organize the tunable(s) with respect to naming, values, etc,
Example: prefetch_{data,metadata}: 0|1
vs prefetch_{data,metadata}_disable:1|0
. OpenZFS might already have a pattern/policy/guidelines/precedents for this.
- Q: My case seeks to disable prefetching data blocks (leaving only prefetching metadata). Is the the reverse ability, to disable metadata prefetching (leaving only data) likely to be useful/valuable? Gut feel says yes, if nothing else than to 'have tunables to configure all available data classes', rather than speculating what may or may not be useful for all/any possible workloads.