Support fallocate(2)

Question

Support fallocate(2)

Closed this issue 5 years ago · 50 comments

Observed by xfstests 075, fallocate(2) is not yet supported

"fallocate is used to preallocate blocks to a file. For filesystems which support the fallocate system call, this is done quickly by allocating blocks and marking them as uninitialized, requiring no IO to the data blocks. This is much faster than creating a file by filling it with zeros."

QA output created by 075
brevity is wit...

-----------------------------------------------
fsx.0 : -d -N numops -S 0
-----------------------------------------------
fsx: main: filesystem does not support fallocate, disabling
: Operation not supported

-----------------------------------------------
fsx.1 : -d -N numops -S 0 -x
-----------------------------------------------

-----------------------------------------------
fsx.2 : -d -N numops -l filelen -S 0
-----------------------------------------------
fsx: main: filesystem does not support fallocate, disabling
: Operation not supported

-----------------------------------------------
fsx.3 : -d -N numops -l filelen -S 0 -x
-----------------------------------------------

Answer 1 · 2011-07-29T20:19:11.000Z

It is potentially difficult to meaningfully implement fallocate() for ZFS, or any true COW filesystem. The intent of fallocate is to pre-allocate/reserve space for later use, but with a COW filesystem the pre-allocated blocks cannot be overwritten without allocating new blocks, writing into the new blocks, and releasing the old blocks (if not pinned by snapshots). In all cases, having fallocated blocks (with some new flag that marks them as zeroed) cannot be any better than simply reserving some blocks out of those available for the pool, and somehow crediting a dnode with the ability to allocate from those reserved blocks.

Answer 2 · 2011-07-30T20:39:55.000Z

Exactly. Implementing this correctly would be tricky and perhaps not that valuable since fallocate(2) is Linux-specific. I would expect most developers to use the more portable posix_fallocate() which presumably falls back to an alternate approach when fallocate(2) isn't available. I'm not aware of any codes which will be to inconvenienced by not having fallocate(2) available... other than xfstests apparently.

Answer 3 · 2011-08-30T18:20:12.000Z

Well you could in theory do something tricky like just creating a sparse file of the correct size. This would avoid the wasted space of storing the zeroed-out data that wouldn't be reusable anyway due to COW. It would unfortunately break the contract that you won't get ENOSPC, but you can't give that guarantee with COW and you would be less likely to hit that after using an enhanced posix_fallocate() since it wouldn't be wasting space on the zeroed pages. Out of curiousity, would there be any difference in final on-disk layout of a sparse file that is filled in vs a file that is first allocated by zero-filling?

I work on mongodb and we use posix_fallocate to quickly allocate large files that we can then mmap. It seems to be the quickest way to preallocate files and have a high probability of contiguous allocations (which again isn't possible due to COW). While I doubt anyone will try to run mongodb on zfs-linux anytime soon (my interest in the project is for a home server), I just wanted to give feedback from a user-space developer's point of view.

Answer 4 · 2012-04-23T20:07:12.000Z

Commit cb2d190 should have closed this.

Answer 5 · 2012-04-24T16:25:09.000Z

I was leaving this issue open because the referenced commit only added support for FALLOC_FL_PUNCH_HOLE. There are still other fallocate flags which are not yet handled.

Answer 6 · 2012-11-26T07:19:44.000Z

@dechamps this doesn't seem to be working for 3.6.x. Looking at your patch for this it looks like this is expected. Is there an update for recent kernels?

11570 open("holes", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0644) = 3
11570 write(3, "\252\252"..., 4194304) = 4194304
11570 fallocate(3, 03, 65536, 196608)   = -1 EOPNOTSUPP (Operation not supported)

3 = fd
03 = mode, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE

Answer 7 · 2012-11-26T08:07:44.000Z

My patch only implements FALLOC_FL_PUNCH_HOLE alone, which is not a valid call to fallocate(). It never worked on any kernel, and will never work until someone implements FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE. Right now it's just a placeholder, basically.

Answer 8 · 2012-11-26T20:03:38.000Z

@dechamps thanks for that clarification

even with FALLOC_FL_PUNCH_HOLE (only):

12260 open("holes", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0644) = 4
12260 write(4, "\252\252"..., 4194304) = 4194304
12260 fallocate(4, 02, 65536, 196608)   = -1 EOPNOTSUPP (Operation not supported)

02 = mode, FALLOC_FL_PUNCH_HOLE

Answer 9 · 2014-01-14T11:06:26.000Z

Apparently fallocate is still not supported on zfs?
But could that be the reason that I cannot seem to use fallocate at all on my systems that have a zfs root, not even on the /boot partition which is on a good ole ext3 slice?

Answer 10 · 2014-01-14T17:17:21.000Z

@RJVB fallocate() for the zfs filesystems has not yet been implemented, however the won't have any impact on an ext3 filesystem.

Answer 11 · 2014-04-04T11:40:11.000Z

You can always use:
dd if=/dev/zero of=bigfile bs=1 count=0 seek=100G
Works immediately.

Answer 12 · 2014-10-03T23:14:38.000Z

As of 0.6.4 the FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE behavior of fallocate(2) is supported. But as noted above, for a variety of reasons implementing a meaningful fallocate(2) to reserve space is problematic for a COW filesystem.

Answer 13 · 2017-11-07T10:38:39.000Z

I'm using 0.7.2-1, and I noticed that if you run posix_fallocate on a file with the same size as the length specified, it returns with EBADF. This doesn't happen when I do it on tmpfs.

//usr/bin/env make -s "${0%.*}" && ./"${0%.*}" "$@"; s=$?; rm ./"${0%.*}"; exit $s

#include <fcntl.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>

int main () {
  int fd = open("./randomfile", O_WRONLY | O_CREAT, S_IRUSR | S_IWUSR);
  if (fd == -1) {
    perror("open()");
  }
  int status = posix_fallocate(fd, 0, 100);
  if (status != 0) {
    printf("%s\n", strerror(status));
  }
  return 0;
}

Running the above on an empty or non-existent file works fine, as soon as you run it again, it fails with EBADF. This is bit strange behaviour.

Answer 14 · 2017-11-07T22:19:29.000Z

@CMCDragonkai that does seem odd. Can you please open a new issue with the above comment so we can track and fix this issue.

Answer 15 · 2017-11-12T10:06:59.000Z

Is allocating disk space (set mode = 0) supported now ?
I notice fallocate still return EOPNOTSUPP

BTW, will fallocate generate less fragmentation than just truncate in random write scenes ?

Answer 16 · 2017-11-12T19:38:29.000Z

No, because ZFS's copy-on-write semantics just plain don't allow that.

Answer 17 · 2017-11-13T08:45:20.000Z

@behlendorf #6860

Answer 18 · 2018-06-01T19:24:41.000Z

@behlendorf While it is not possible (due to CoW) to have a fully working fallocate, it would be preferable to have at least a partially-working implementation: some applications[1] use fallocate to create very big files and on non-fallocation filesystem this is a very slow operation. Granted that ZFS and its CoW defeat one of the main fallocate feature (ie: to really reserve space in advance), paying the slow (and SSD-wearing) "fill entire file with 0" behavior is also quite bad.

How would you consider to implement a "fake" fallocation, where fallocate returns success but no real allocation is done? After all, even after a "real" fallocate, reserved space is not guaranteed, as any snapshot can eat into the really available disk space.

[1] One of such application is virt-manager: RAW disk images are, by default, fully fallocated. This, depending on disk size, mean GB or TB of null data (zeroes) written to HDDs/SSDs.

Answer 19 · 2018-06-01T19:40:54.000Z

I'd say that makes sense. According to my fallocate man page

The default operation (i.e., mode is zero) of fallocate() allocates and initializes to zero the disk space within the range specified by offset and len. The file size (as reported by stat(2)) will be changed if offset+len is greater than the file size.

I don't see how that is different from writing `len` 0s to disk starting at `offset`, am I missing something. Either way, maybe doing that write inside the ZFS driver means it can be done more efficiently than when using userland calls?

Answer 20 · 2018-06-01T22:04:28.000Z

@RJVB On filesystem supporting fallocate, the filesystem reserves len/blocksize blocks and marks them as uninitialized. This has the following consequences:

as blocks are marked as reserved/allocated, the user space application which called fallocate is sure that sufficient space is available to write to all such blocks;
as no user data are written (and only some very terse metadata are flushed to disk), fallocate returns almost immediately, enabling very fast file allocations.

Point n.1 (space reservation) is at odds with ZFS because as a CoW filesystems, by its very nature, it continuously allocate new data blocks while keeping track of past ones via snapshots. This means that you can't really count on fallocate to guarantee sufficient disk space to write all blocks, unless you tap into the reservation and/or quota properties. However, if I remember correctly, these properties only apply to an entire dataset, rather than to a single file.

And here come point n.2 - fast file allocation. On platform where fallocate is not natively supported, both the user space application and the libc function can force a full file allocation by writing zeroes for all its length. This is very slow, cause unnecessary wear on SSD and is basically useless on ZFS. Hence my suggestion to always return "true" for fallocate, even when doing nothing (ie: faking a successfull fallocate).

Opinions are welcomed!

Answer 21 · 2018-06-02T08:43:07.000Z

- as no user data are written (and only some very terse metadata are flushed to disk), `fallocate` returns almost immediately, enabling very fast file allocations.

You'll notice that the manual does mention writing 0s to disk. And as to COW systems being at odds with space reservation: btrfs supports it and AFAIK that's a COW filesystem too.

tap into the reservation and/or quota proprierties. However, if I remember correctly, these proprierties only apply to an entire dataset, rather than to a single file.

That doesn't matter for the space reservation aspect, right? Available space is not a per-file property but one of the filesystem/dataset. OK, the file size may not show up as one might expect, but aren't we all used to open files not always showing their exact size while available disk space does show the actual value.

the libc function can force a full file allocation by writing zeroes for all its length. This is very slow, cause unnecessary wear on SSD *and* is basically useless on ZFS.

Why is it basically useless? You still get the space reservation, no? My point is that doing the write in the driver might be more efficient. I don't disagree with your suggestion but software that nowadays falls back to the brute-force method because fallocate() fails might start behaving unexpectedly. Maybe a driver parameter that can be controlled at runtime could activate a low-level actual write-zeroes-to-disk implementation? Another solution would be to implement a per-file reservation attribute, something like a simulated file size (which can only be larger than the actual file size) which is taken into account for the determination of available disk space (but not for used disk space?). I really fail to see how this would not provide a usable implementation. There are probably combinations with existing file size/content and the fallocate offset/size parameter that I can't get my head around, but you should always be able to let the fallocate() call fail if you detect one of those. That wouldn't be the 1st ZFS feature that is developed on ZoL first and only then presented for upstreaming to OpenZFS. And this one might be easy to get upstreamed; I presume *BSD have fallocate too (or could use some form of it). And FWIW, the fake fallocate feature would probably have to be presented for upstreaming too.

Answer 22 · 2018-06-02T17:16:08.000Z

@RJVB

And as to COW systems being at odds with space reservation: btrfs supports it and AFAIK that's a COW filesystem too

fallocate on BTRFS behave differently than on non-CoW filesystems: while it really allocates blocks for the selected file, any rewrite (after the first write) triggers a new block allocation. This means that file fragmentation is only slightly reduced, and can potentially expose some rough corner with a near-full filesystem.

Why is it basically useless? You still get the space reservation, no?

If you tap onto existing quota/reservation system (which, anyway, operate on dataset rather than single file), yes, I'll end with working space reservation. But if you only count on the fallocated reserved block, any snapshot pinning old data can effectively cause a out of space condition even when writing on fallocated files. Something similar to that:

fallocate a file writing all zeroes to it
create a snapshot
rewrite your fallocated file
new space is allocated, which can result in ENOSP condition.

My point is that doing the write in the driver might be more efficient. I don't disagree with your suggestion but software that nowadays falls back to the brute-force method because fallocate() fails might start behaving unexpectedly. Maybe a driver parameter that can be controlled at runtime could activate a low-level actual write-zeroes-to-disk implementation?

I really fail to see why an user space application should fail when presented with a sparse file rather than a preallocated one. However, as you suggest, simple let the option be user selectable. In short, while a BTRFS-like fallocate would be the ideal solution, even a fake (user-selectable) implementation would be desiderable.

Answer 23 · 2019-06-08T04:02:29.000Z

Hit this on one of our sites. Turned out a backend process was using rsync

rsync: do_fallocate "/tank/foo": Operation not supported (95)

        if (preallocate_files && fd != -1 && total_size > 0 && (!inplace || total_size > size_r)) {
                /* Try to preallocate enough space for file's eventual length.  Can
                 * reduce fragmentation on filesystems like ext4, xfs, and NTFS. */
                if ((preallocated_len = do_fallocate(fd, 0, total_size)) < 0)
                        rsyserr(FWARNING, errno, "do_fallocate %s", full_fname(fname));
        } else

There is a fallback to posix_fallocate that is for whatever reason not enabled in CentOS 7

#elif defined HAVE_EFFICIENT_POSIX_FALLOCATE
        ret = posix_fallocate(fd, offset, length);
#else

Answer 24 · 2019-06-28T08:07:06.000Z

FYI:
Yesterday I ran into the problem of running an ancient (parallel scsi ancient) statically linked archiving software that was fallocat'ing (mode=0) and worse fall back to posix_fallocate the archives before filling them up, no questions asked. Because "its faster" ...on a dec alpha.
Usually it uses a separate internal disk, but this time we tried to export the archives directly via nfs by the magic of bootable CDs, hoping nfs4.2 would support native fallocate nowadays. The goal being to get the archives exported off the machine in one swoop on the cheap.
Well it tanked. After debugging nfs up and down, I finally remembered that the target storage is running zfs.

Thinking about possible workarounds like preloading a library to send posix_fallocate flying (how to with static app) or dd'ing the entire thing to run it in qemu, I took the following line to heart:

Hence my suggestion to always return "true" for fallocate, even when doing nothing (ie: faking a successfull fallocate).

so with an evil chuckle I added the following lines right at the start of zpl_fallocate_common and ran it on a hastily assembled testbed:

if (mode == 0)
    return(0);

That can't possible work right? Would you know, that ancient piece of ... actually ran fabulous.
The alternative would have been to format the testbed ext4 and be done with it, but that would've been no fun.

Now I'm waiting for the pool to burn to a crisp and possible explode in a violent fashion and of course it wasn't reserving the file's space, but it sure was satisfying to see it work at all.

So in case there won't be a true fallocate solution in the near future, I'd vote for a module parameter to ignore/fake fallocate if you know what you're doing.

Btw. virt-manager also likes to fallback to posix_fallocate on nfs4.2 over zfs backends taking forever to complete if the preallocation setting is accidentally left >0. Also tried it with the above hack, and it actually creates a decent sparse file, was kinda surprising.

Answer 25 · 2019-10-08T13:02:47.000Z

I just hit this on the upcoming Ubuntu Eoan 19.10 while creating a VM using virt-manager. I noticed it was super slow compared to my Bionic 18.04 system, so I checked and noticed that the qemu-img call was now using fallocate. The command line was something like this:

qemu-img create -f qcow2 -o preallocation=falloc with-prealloc-image.qcow2 40G

That on an ext4 system takes half a second. On a system with zfs (tried both 0.7.x and 0.8.x) it takes over a minute.

Answer 26 · 2019-10-09T19:52:13.000Z

TL;DR: Make zpl_fallocate_common() return 0 (success) when mode == 0, at least when compression is enabled.

Since this issue was opened, there are several new modes to fallocate(2). FALLOC_FL_COLLAPSE_RANGE and FALLOC_FL_INSERT_RANGE are probably the hardest to implement. They are not the point of my comment today, so I'm not going to discuss those further in this comment.

FALLOC_FL_ZERO_RANGE sounds the same as FALLOC_FL_PUNCH_HOLE except that it leaves a preallocation in place (and FALLOC_FL_KEEP_SIZE is optional). If my proposal is adopted, it is likely that FALLOC_FL_ZERO_RANGE could be handled the same as FALLOC_FL_PUNCH_HOLE (with the caveat about FALLOC_FL_KEEP_SIZE). This is not the main point of my comment, so I won't discuss this further in this comment either.

I'm mainly here to discuss the mode == 0 case for pre-allocating disk space. Currently, ZFS returns EOPNOTSUPP in this case. An application has three choices of how to handle this:

FAIL: The application fails, cleanly or uncleanly. Real (i.e. non-test) applications don't do this, as fallocate() is not implemented on anywhere near all filesystems. Therefore, we can ignore this scenario.
IGNORE: The application ignores the failure. It is treating fallocate() as a hint.
ZERO: The application falls back to pre-allocating the space by writing data. In practice, the data will always be all zeros, because that is what a filesystem would return if the fallocate() was successful. This is the behavior implemented by glibc's posix_fallocate() wrapper.

I'm proposing that ZFS fake the pre-allocation. I know that lying to userspace feels wrong, but...

For applications in the IGNORE case, faking the pre-allocation is irrelevant. They gave a hint and would ignore the failure anyway. So there is absolutely no change in that scenario.

The scenario that changes is the ZERO case. In the ZERO case, the application will then fall back to writing zeros, which is expensive. When ZFS has compression enabled, those zeros will immediately be thrown away, so the application isn't achieving anything at all with the fallback. ZFS performance suffers dramatically relative to other filesystems, leading to complaints e.g. with virtual machine disk allocation.

Thus, at a minimum, there should be zero downside to faking the pre-allocation when compression is turned on. I propose we do at least that.

When compression is off, the pre-allocation is still worthless at best and counterproductive at worst. The pre-allocation is intended to increase performance (by allocating a contiguous range of blocks, or a contiguous extent) and/or guarantee that writes will not fail with ENOSPC. With ZFS, the attempt to allocate contiguous space is pointless, as it will not improve the placement of future (over)writes. Likewise, writing data does not guarantee that (over)writes can succeed, and in fact my be counterproductive, as it can lead to the filesystem running out of space/quota if the zeros are retained in a snapshot.

The only reason I can see for not faking the preallocation in the non-compression case is technical correctness. If we really want to maintain traditional semantics, even though they're useless or even counterproductive, then we can limit the fake preallocation to the compression case. But if it were my call, I'd fake it in both cases. I believe that is the pragmatic solution.

If there is some dream of eventually supporting real preallocation in some way, that's fine. But let's not let perfect and someday be the enemy of good enough right now. We can always replace the fake preallocation with real preallocation if someone figures out a way to do it in ZFS.

Answer 27 · 2019-10-09T20:44:15.000Z

@rlaager you make an excellent case. While I'd rather not lie to user space, I do have to agree that it is the pragmatic thing to do for the mode == 0 case. Not supporting this functionality has been causing more harm than good. If someone is interested in authoring a PR for this I'd be happy to help with any design work and code review.

Answer 28 · 2019-10-09T20:51:23.000Z

@rlaager I 100% agree with your proposal, which is a more detailed version of what I proposed here. The same dilemma applied: should we lie to userspace? In this specific case (and the specific constrains you described) I think the answer is "yes".

@behlendorf what do you think about having a zfs property or module option to control the new fallocate-facking behavior?

Answer 29 · 2019-10-09T20:59:34.000Z

@rlaager it probably makes sense to go further and do nothing for fallocate(mode=0) regardless of whether compression is enabled or not, with a fallback of a per-fileset tunable (default enabled) to control whether fallocate() will fake the preallocation or return EOPNOTSUPP in case someone finds a strange corner case that is harmed by this behavior.

My reasoning is that even if there was a mechanism to have fallocate() reserve space for that file in the pool for preallocated blocks, this would only work for the first write. After the first write to the file, the "space reservation" aspect would be gone and any subsequent overwrite would be no different than if fallocate() had done nothing at all (beyond modifying the file size if FALLOCATE_FL_KEEP_SIZE is not specified).

The other main reason to use fallocate(mode=0) is to avoid fragmentation of allocated blocks in the file. However, since ZFS is COW there is again no benefit to having the blocks preallocated, since overwriting them will again result in new block allocations.

It probably makes sense to return ENOSPC if the fallocate(mode=0) size is larger than the amount of free space in the filesystem (or maybe 95%) to avoid cases where someone tries to reserve a 1TB file in a 100GB pool. That is probably the best that could be done, given that even elaborate space reservations in the pool tied to a specific file would fail if one or more snapshots is pinning the reserved space.

I'd also agree that FALLOCATE_FL_ZERO_RANGE is not really different than PUNCH_HOLE for ZFS, so could easily be added. I don't know of any users of INSERT_RANGE and COLLAPSE_RANGE, and they are complex to implement so probably do not deserve much attention.

Answer 30 · 2019-10-09T23:32:37.000Z

with a fallback of a per-fileset tunable (default enabled) to control whether fallocate() will fake the preallocation or return EOPNOTSUPP in case someone finds a strange corner case that is harmed by this behavior.

I recommend against a tunable to turn it off. If we add a tunable, it's really, really hard to know that nobody is using it, and thus really, really hard to remove it. If we wait until a problem presents itself, we can solve it then, by adding a tunable or something else (depending on the problem). Also, a tunable requires more work and more test cases (to ensure it actually works).

If we do add a tunable, I'd suggest a module parameter, as those present less of a user-visible compatibility concern, at least in my mind, than a filesystem parameter.

It probably makes sense to return ENOSPC if the fallocate(mode=0) size is larger than the amount of free space in the filesystem (or maybe 95%) to avoid cases where someone tries to reserve a 1TB file in a 100GB pool.

Yes, that makes sense. I hadn't considered that. Thanks!

Answer 31 · 2019-10-10T04:03:36.000Z

My reasoning is that even if there was a mechanism to have fallocate() reserve space for that file in the pool for preallocated blocks, this would only work for the first write. After the first write to the file, the "space reservation" aspect would be gone and any subsequent overwrite would be no different than if fallocate() had done nothing at all (beyond modifying the file size if FALLOCATE_FL_KEEP_SIZE is not specified).

Well, unless userspace calls fallocate() before subsequent writes. I think there's an argument to be made about actually reserving free space for the first write, so as to not lie to userspace.

Answer 32 · 2019-10-10T07:11:50.000Z

@CAFxX I think it would be great to leave the implementation as simple as possible. On ZFS, even reserving space for the first write will be practically useless, as snapshots can "eat" into the available space without notice.

@rlaager what is the general logic for module option vs zfs property? As stand now, it seems to me that we have some confusion about the two (eg: cache is controller via a property, while prefetch via a module option).

Answer 33 · 2019-10-10T07:45:21.000Z

On ZFS, even reserving space for the first write will be practically useless, as snapshots can "eat" into the available space without notice.

This is definitely the case right now, but I would expect that a "real" reservation of free space - if properly implemented - wouldn't allow snapshots to consume the reserved space.

I understand the desire to keep things simple, at least at the beginning, but lying to userspace is a pretty slippery slope, especially if the argument goes "we're already lying now in other cases, we may as well lie in this one".

Moreover, the argument that "we can add it later if someone figures out how to do it in ZFS" is pretty bogus, as at that point the API contract is broken and some user most likely will have come to depend on the broken contract, so you will likely need to bolt on the correct behavior via a different API altogether. To be pragmatic, this wouldn't apply if the "broken" behavior was non-default, behind some configuration.

Answer 34 · 2019-10-10T08:04:14.000Z

@CAFxX I understand your concerns.

However, due to snapshots/subvolume, the API contract is already broken on CoW filesystems (even on BTRFS which actually honors preallocation for the first write): posix_fallocate has a "write-all-zeroes" fallback, which is not sufficient to guarantee that free space will always exists for that very same preallocation (especially when compression is on).

So, at the moment, we have the worst of both worlds:

preallocating is very slow (and can cause unnecessary wear on SSD if compression=off);
but no preallocation is really done!

On ZFS, if one really want to preallocate some space (ie: to be sure it will be always enough free space to handle the preallocation), he must use quota and/or reservation/refreservation properties.

Answer 35 · 2019-10-10T08:21:33.000Z

I understand and I'm not saying we should not do this. I am wholeheartedly in favor of having this as the non-default behavior, so we get the best of all worlds:

a workaround now (by opting-in to the broken behavior) for users that need it
a reliable path forward (by not further tainting public APIs with incompatible behaviors)

Answer 36 · 2019-10-10T08:56:57.000Z

the argument that "we can add it later if someone figures out how to do it in ZFS" is pretty bogus, as at that point the API contract is broken and some user most likely will have come to depend on the broken contract

That's a risk with API contracts in general, but not likely in this case.

I'm not sure how an application would come to rely on ZFS faking fallocation(2) pre-allocation, especially in a world where other (far more popular) filesystems actually implement pre-allocation. If an application suddenly breaks if ZFS truly implements fallocate(2) preallocation, it would already be broken on ext4 and XFS today, and those are the default filesystems in major distros.

Answer 37 · 2019-10-11T00:34:57.000Z

IMHO zfs should collapse full block zero writes into a hole anyway, always, regardless of compression being on or not. Simply as I see no point in storing null data as an on-disk block in a CoW system.

For fallocate to reserve space it would need to be turned into increasing the reservation/refreservation properties of the filesytem, this is a can of worms that I wouldn't like to open as it implies an interesting decision tree revolving around what magic needs to be attached to that file to reduce the reservation in case of truncating, deletion (or whatever else might make sense in this context). While magic might be useful at times... I would guess adding this would just increase support requests without adding any benefits to the user.

Failing the fallocate in case there isn't enough zfs AVAIL sounds like a reasonable plan.

Answer 38 · 2019-10-11T08:28:46.000Z

IMHO zfs should collapse full block zero writes into a hole anyway, _always,_ regardless of compression being on or not. Simply as I see no point in storing null data as an on-disk block in a CoW system.

I don't understand most of this discussion but this sounds like something you'd want for any filesystem, CoW or not (it's a form of RLE, no?). That said, it also sounds as something that can't be backwards compatible.

Answer 39 · 2019-10-22T02:09:52.000Z

Interesting.

Answer 40 · 2019-11-13T23:18:13.000Z

If I may: Consider what happens in the case of another filesystem (ie ext4) on a zvol, whether directly or indirectly (an hvm); I would objectively consider the semantics of doing native_fallocate in that case, correct.

Answer 41 · 2019-11-13T23:26:43.000Z

@von-copec We are only talking about fallocate() in the ZFS POSIX layer (ZPL) filesystems. If someone puts ext4 on top of a zvol or a file, then ext4 still behaves exactly as it always has.

Answer 42 · 2019-11-13T23:56:46.000Z

@von-copec We are only talking about fallocate() in the ZFS POSIX layer (ZPL) filesystems. If someone puts ext4 on top of a zvol or a file, then ext4 still behaves exactly as it always has.

I understand, I was attempting to emphasize that the behavior of another filesystem layer versus the ZPL would be considered correct when it is on top of a (sparse) ZVOL, and so the ZPL doing the same thing would be the "same amount of correctness".

Answer 43 · 2020-06-05T09:54:25.000Z

An update on this topic. In the course of implementing fallocate(mode=0) for Lustre-on-ZFS (https://review.whamcloud.com/36506) the dmu_prealloc() function was being used to implement the preallocation. While this patch isn't working yet, it is informative on this topic. The dmu_prealloc() function has been in ZFs for a long time, but is only used on Illumos to preallocate space in a ZVOL for a core dump, so that the core can later write directly into the ZVOL blocks without invoking ZFS, in case ZFS is itself broken by the crash:

zvol_dump_init->zvol_prealloc->dmu_prealloc->dmu_buf_will_not_fill()

This sets DB_NOFILL on every dbuf. The interesting thing is that dmu_prealloc() actually preallocates the blocks on disk, and the DB_NOFILL->WP_NOFILL results in the leaf (data) blocks being marked in dmu_write_policy() with ZIO_COMPRESS_OFF and ZIO_CHECKSUM_OFF. This is essentially what fallocate(mode=0) wants, namely to have reserved space that is not compressed and can be overwritten (at least once, anyway) without running out of space.

Several open questions exist, since there is absolutely no documentation anywhere about this code:

what does dmu_prealloc() do to blocks that were previously allocated? fallocate(mode=0) must not modify existing blocks, only allocate new blocks.
the DB_NOFILL appears to make reads of these buffers return zero, which seems correct, so long as it is cleared when the blocks are overwritten. Otherwise, it would not be good if normal writes cannot be read back.
can the DB_NOFILL buffers be overwritten by normal DMU writes, clearing the DB_NOFILL state?
does the ZIO_CHECKSUM_OFF flag persist if the block is overwritten via normal DMU IO? That would be unfortunate, as it means dmu_prealloc() blocks would not be safe for user data, but could be fixed with a new WP_* flag.

This seems to be a path toward implementing fully-featured ZFS fallocate(mode=0), possibly with some digging in the guts of the code if the semantics are not quite as needed.

If this doesn't work out, it still seems practical to go the easy route, for which I've made a simple patch that implements what was previously described here and could hopefully be landed with a minimum of fuss. I don't have any idea how long it would take the dmu_prealloc() approach to finish, but it would need the changes in my patch anyway.

Answer 44 · 2020-06-05T10:02:07.000Z

Several open questions exist, since there is absolutely no documentation anywhere about this code:

Probably an open door, but have you tried to answer your questions by poking around under an Illumos implementation?

Answer 45 · 2020-06-05T11:28:26.000Z

Several open questions exist, since there is absolutely no documentation anywhere about this code:
Probably an open door, but have you tried to answer your questions by poking around under an Illumos implementation?

Yes, the Illumos implementation references this function exactly once, in the code path referenced above, but no actual comments exist in the code that describe these functions.

Answer 46 · 2020-06-05T12:57:09.000Z

Yes, the Illumos implementation references this function exactly once, in the code path referenced above, but no actual comments exist in the code that describe these functions.

I meant empirically, triggering the situations/behaviours you have questions about.

Answer 47 · 2020-06-05T13:07:36.000Z

@adilger I am right that this preallocation would use the preallocated blocks for the first write only? If so, this seems somewhat similar to BTRFS approach. If so, I am missing why an application (Lustre, in this case) should expect fallocate being really honored considering that:

a snapshot can easily eat into the pool free space, leading to ENOSPC even if prellocation was successful
rewriting the file will cause ongoing fragmentation, negating any performance benefit from the previous allocation (with some time)

Disabling compression and checksum seems a way too high price to pay for the very limited benefit (if any) which can be obtained by "true" preallocation on ZFS.

Considering how posix_fallocate simply write zeroes on a filesystem not supporting fallocate, and that these zeroes would be converted to a sparse file if compression is enabled, I would simply suggest to create a module option/pool property/flag "faking" true fallocate (returning 0 but ignoring the operation entirely). I know this sounds bad because it broke the "contract" of the fallocate API itself; however, no such guarantees exists for compressing, CoW filesystem (which seems similar to what you are proposing here, right? #10408)

Answer 48 · 2020-06-05T17:04:50.000Z

@shodanshok, I understand and agree that all of those issues exist.

Lustre is a distributed parallel filesystem that layers on top of ZFS, so it isn't the thing that is generating the fallocate() request. It is merely passing on the fallocate() request from a higher-level application down to ZFS, after possibly remapping the arguments appropriately.

"I would simply suggest to create a module option/pool property/flag
"faking" true fallocate (returning 0 but ignoring the operation entirely)"

I've essentially done exactly that with my PR#10408. However, while this probably works fine for a large majority of use cases, it would fail if eg. an application is trying to fallocate multiple files in advance of writing, or in parallel, but there is not actually enough free space in the filesystem. In that case, each individual fallocate() call would verify enough space is available, but the aggregate of those calls is not available. Fixing this would need "write once" semantics for reserved blocks (similar to what dmu_prealloc() provides), or at least an in-memory grant that reserves space from statfs() for the dnode that is released as dbufs are written to the dnode. That would at least avoid obvious multi-file issues, but not prevent other writers from consuming this space. It would also get tricky with fallocate() over a non-sparse file, and whether writes are overlapping, etc. so would not be the preferred solution IMHO.

Answer 49 · 2020-12-22T04:25:40.000Z

Closing. As discussed above basic fallocate() support was added by #10408.

Answer 50 · 2021-01-04T09:32:20.000Z

This bug is fixed in MariaDB 10.1.48, 10.2.35, 10.3.26, 10.4.16, 10.5.7 by MariaDB Pull Request #1658 a.k.a. adding fall-back logic for the code EOPNOTSUPP