openzfs/zfs

ZFS Fragmentation: Long-term Solutions

bitshark opened this issue · 9 comments

Hey guys.. This is somewhat of a question, issue, and "viability of a feature request" all rolled into one.

First , ZFS is awesome, and I use it to host VMs. I am looking seriously into ZFS as a solution for enterprise OpenStack deployments.

My questions is on fragmentation -- (I think it is a problem , but I'm no expert in this like you guys). .. . Here are a few links that I've been reading on ZFS fragmentation:

Links to data which may indicated problems with ZFS fragmentation
http://blog.delphix.com/uday/2013/02/19/78/
https://bartsjerps.wordpress.com/2013/02/26/zfs-ora-database-fragmentation/
http://myverylittletricks.net/code/?page_id=318
http://thomas.gouverneur.name/2011/06/20110609zfs-fragmentation-issue-examining-the-zil/

Is correct that ZFS fragmentation appears to be a significant issue under certain workloads? What's the best way to avoid it, using the latest zfsonlinuz codebase?

  • What is the state of (and feasibility of a long-term fix for ZFS fragmentation)?
  • Is there a BPR Solution?: I noticed references to a rumored solution for ZFS fragmentation by writing how the BPR (block pointer rewrite) is handled in the filesystem.
  • Is that correct or in correct? Is changing the BPR a technically viable solution? And is it a realistic one (ie breaking backward compatibility etc)? Can someone state the problem as a general computer science problem, even if it's impossible to solve (I'm genuinely curious about this)?

Here are mentions I found to fixing fragmentation with the supposed BPR code. . .

  • "BPR is still vaporware - no demonstrable code exists, much less a stable implementation. A send-receive cycle is indeed likely to help in getting a defragmented pool, but this will mean downtime for the dataset sent/received. "

http://serverfault.com/questions/511154/zfs-performance-do-i-need-to-keep-free-space-in-a-pool-or-a-file-system

  • "Been fighting this issue degenerate performance issue on every ZFS system I have deployed. Sometimees I long for UFS and raw devices especially for RDBMS.
  • From what I read BPR would solve it but doesn’t seem like anyone will add this maybe it is just too complex. For know copy things to fresh zpool/spindles and recycle the old.
  • The bottom line toss 70% of your storage to limit the performance drop to just “50%” in a high random IO write environment. Your chart although modeling real life if very discouraging."

http://blog.delphix.com/uday/2013/02/19/78/

  • "Fragmentation does remain a long-term problem of ZFS pools. The only real answer at the moment is to move the data around -- eg: zfs send|zfs recv it to another pool, then wipe out the original pool and recreate, then send back.
  • The 'proper' fix for ZFS fragmentation is known -- it is generally referred to as 'block pointer rewrite', or BPR for short. I am not presently aware of anyone actively working on this functionality, I'm afraid.
  • For most pools, especially ones kept under 50-60% utilization that are mostly-read, it could be years before fragmentation becomes a significant issue. Hopefully by then, a new version of ZFS will have come along with BPR in it."

http://nex7.blogspot.com/2013/03/readme1st.html

Apologies for off-topic, I am also using ZFS to host VMs but there is scarce guidance how to optimize ZFS options and VM options for such use. Do you know/have any resources to direct me to?

BPR answer here: https://www.youtube.com/watch?v=G2vIdPmsnTI#t=44m53s

On Sat, Jul 11, 2015 at 6:36 PM, Bronek Kozicki notifications@github.com
wrote:

Apologies for off-topic, I am also using ZFS to host VMs but there is
scarce guidance how to optimize ZFS options and VM options for such use. Do
you know/have any resources to direct me to?


Reply to this email directly or view it on GitHub
#3582 (comment).

Ok, wow this guy is awesome .. Thanks for the video. I love the video , I've transcribed it here..

TLDR; " It's like changing your pants while you're running... Deleting snapshots, creating snapshots , while you're changing what those snapshots are trying to reference."

Matt Ahrenz on ZFS / "Block-pointer Rewrite project for ZFS Fragmentation"
-=-=-=-=-=-

"So BP Rewrite is a project I was working on at Sun. And the idea was.. uh.. very all encompassing. We would be able to take .. .any... block... on disk and be able to modify anyway we need to. Allocate it somewhere else, or change the compression, change the checksum, de-dup the block, or not de-dup it... And change ..uh... keep track of that change.

It was called BP Rewrite because we need to change the block pointer... to point to some new block. ... So uh... the tricky part... Well.. .The straightforward implementation is to traverse the blocks .. the tricky thing about doing this on ZFS...

(1) One is that you can have... there can be many pointers to a block, because of snapshots and clones. One block on disk can be pointed to by ten different clones. It creates this problem where if you have a bunch of instances of that block pointer, when [we] traverse all the block pointers, i'm going to visit that old block pointer several times. we need to REMEMBER that we changed from this old block pointer to the new block pointer. in other words, we moved this particular block from place A to place B.

So that if I see another pointer to place A, then I know to change it to place B. This creates a performance problem, because you end up having to have a giant hash table that maps from the old location to the new location. If you're familiar with ZFS Dedup, then you're aware that also involves a giant hash table mapping from the blocks' checksum to the location on disk that it's stored, and uh.. ref count. And if you ever used dedup in practice on very large data sets, then you're probably aware that the performance of that is not very great. So there are similar performance problems with BP Rewrite.

And there's also some additional tricky-ness because ZFS is very full-featured. The space used by a given block is accounted for in many different places in many different layers of ZFS. So for example, [space used] is counted in the d_node, so that each file knows how much space it's using . So [that's would be] like ls -al or df . That counts up the amount of space. It's also accounted for in the dsl layer in a bunch of different place... so you know, in zfs list , the space used by each filesystem . Which impacts all the snapshots, and all the parent filesystems, because the space is inherited up the tree , in terms of the space used.

There's a bunch of different places that space accounting needs to happen.. Making sure all those [numbers] get updated accurately when a block changes size is very tricky. So for all those reasons.. as I was working on this... I was very concerned that this would be the last feature ever implemented in ZFS -- because... uhm... As most programmers know, magic does not layer well.

And the BP Rewrite was definitely magic, and it definitely broke alot of the layering in ZFS. It needed to have code in several different layers know -- to have intimate knowledge of how this all worked.

On top of all that, were doing it... We wanted to be able to do this live, on a live file system. This is very important if [running BP rewrite] is going to take weeks, because of the performance issues. It's like changing your pants while you're running... Deleting snapshots, creating snapshots , while you're changing what those snapshots are trying to reference.

So those are some of the issues.. with that. Uhm the uh... because of my concerns about the layering, I'm actually kind of glad that project was not completed. Because I think it would have had some big implications on the difficulty of adding other features after it.

I dont think anyone is attempting a full-on BP Rewrite do anything kind of implementation. Some people have looked at it from a restricted standpoint... It would depend on the type of implementation.

A separate utility based on libzpool without adding code to it would be great. I would very much welcome a separate utility that lets you off-line BP Rewrite your stuff. The issue would be how deeply the utilities fingers would be stuck into that [library] code.

But you still have the issue of the performance of having this giant hash table, and you have to update the accounting at every [ZFS] layer."

.

Well... what about putting the giant hash table for ZFS Dedup (the block pointer giant hash table) ... What about putting this in a network hash table?

For example. I have a 250 GB disk that needs ZFS BP Rewrite. Why can't a standalone utility make the multi-gigabyte table of block mappings, and then upload it to the network (like Redis or S3 or something).. Upload it in like 1GB chunks or something

Then the standalone utility downloads like 1GB or so of the giant hash table at a time and stores it in memory. Then it does the ZFS dedup on the blocks it has , or can do with the portion of the hash table in memory.

Alternately, it could have nothing in RAM and just fetch everything it needs to off the network. What's the problem with assuming 'unlimited' free space for the BP Rewrite using new cloud-based hash tables ?

What about:

  1. Find all pointers to the old block
  2. Copy the block's data, update the pointers
  3. Find all new snapshots,clones created after step 1
  4. Update their pointers.

No need for the hash table. Another option would be to do this offline (no need for step 3 or 4), that would be better than send/receive because it would not require temporarily putting many terabytes of data somewhere.

An offline de-fragmentation solution is interesting and might be more viable. Is there any discussion on this topic?

bassu commented

I am surprised that nobody mentioned the dedicated SLOG to tackle this problem.
Reference: ZIL and Data Fragmentation

I'm running 32TB with 28% fragmentation - but that's 28% of files, not 28% of data space.

One possible solution is the shake utility - however it's best to have a fragmentation report rather than blindly "shaking" everything.

If I'm understanding correctly, there is still (7 years later) no long term solution to fragmentation once it occurs. There are some suggestions that one should avoid using more than 70% / 80% / 96% of their pool to reduce fragmentation. Otherwise, the only option is to build a second storage pool of the same size, move the data there, and destroy the old pool. For my current (home) use case, that's not an option, so I will have to live with fragmentation for another year or two until it is time for my next NAS upgrade. I got my pool to 95% full recently and am currently sitting at 10% fragmentation. Performance is (thus far) unaffected.