openzfs/zfs

Direct IO

behlendorf opened this issue ยท 59 comments

The direct IO handlers have not yet been implemented. Supporting direct IO would have been a problem a few years back because of how ZFS copies everything in to the ARC cache. However, recently ZFS started supporting a zero-copy interface which we may be able to leverage for direct IO support.

hmm. why not to do it in that way: let O_DIRECT always return true? does it metter that ZFS copies everything in to the ARC cache? let fake a bit an OS. It shouldn't hurt so much.... oh, and that is just my freak idea

Unable to start mysqld with InnoDB databases living in a ZFS dataset. Is this related to this issue?

Using ppa:zfs-native/stable on Precise using Quantal kernel.

Here is the info of the system and dataset, followed by info from log snipped from /var/log/syslog

root@HumanFish:/# uname -a
Linux HumanFish.net 3.5.0-17-generic #28-Ubuntu SMP Tue Oct 9 19:31:23 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

root@HumanFish:/# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.1 LTS
Release: 12.04
Codename: precise

root@HumanFish:/# zfs get all zpool/mysql
NAME PROPERTY VALUE SOURCE
zpool/mysql type filesystem -
zpool/mysql creation Mon Oct 29 9:44 2012 -
zpool/mysql used 71M -
zpool/mysql available 204G -
zpool/mysql referenced 71M -
zpool/mysql compressratio 3.38x -
zpool/mysql mounted yes -
zpool/mysql quota none default
zpool/mysql reservation none default
zpool/mysql recordsize 128K default
zpool/mysql mountpoint /var/lib/mysql local
zpool/mysql sharenfs off default
zpool/mysql checksum on default
zpool/mysql compression lzjb local
zpool/mysql atime on default
zpool/mysql devices on default
zpool/mysql exec on default
zpool/mysql setuid on default
zpool/mysql readonly off default
zpool/mysql zoned off default
zpool/mysql snapdir hidden default
zpool/mysql aclinherit restricted default
zpool/mysql canmount on default
zpool/mysql xattr on default
zpool/mysql copies 2 local
zpool/mysql version 5 -
zpool/mysql utf8only off -
zpool/mysql normalization none -
zpool/mysql casesensitivity sensitive -
zpool/mysql vscan off default
zpool/mysql nbmand off default
zpool/mysql sharesmb off default
zpool/mysql refquota none default
zpool/mysql refreservation none default
zpool/mysql primarycache all default
zpool/mysql secondarycache all default
zpool/mysql usedbysnapshots 0 -
zpool/mysql usedbydataset 71M -
zpool/mysql usedbychildren 0 -
zpool/mysql usedbyrefreservation 0 -
zpool/mysql logbias latency default
zpool/mysql dedup off default
zpool/mysql mlslabel none default
zpool/mysql sync standard default
zpool/mysql refcompressratio 3.38x -
zpool/mysql written 71M -

Oct 29 09:45:37 HumanFish mysqld_safe: Starting mysqld daemon with databases from /var/lib/mysql
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: The InnoDB memory heap is disabled
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Mutexes and rw_locks use GCC atomic builtins
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Compressed tables use zlib 1.2.3.4
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Using Linux native AIO
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Initializing buffer pool, size = 256.0M
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Completed initialization of buffer pool
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Failed to set O_DIRECT on file ./ibdata1: OPEN: Invalid argument, continuing anyway
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: O_DIRECT is known to result in 'Invalid argument' on Linux on tmpfs, see MySQL Bug#26662
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Failed to set O_DIRECT on file ./ibdata1: OPEN: Invalid argument, continuing anyway
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: O_DIRECT is known to result in 'Invalid argument' on Linux on tmpfs, see MySQL Bug#26662
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: highest supported file format is Barracuda.
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Operating system error number 22 in a file operation.
Oct 29 09:45:37 HumanFish mysqld: InnoDB: Error number 22 means 'Invalid argument'.
Oct 29 09:45:37 HumanFish mysqld: InnoDB: Some operating system error numbers are described at
Oct 29 09:45:37 HumanFish mysqld: InnoDB: http://dev.mysql.com/doc/refman/5.5/en/operating-system-error-codes.html
Oct 29 09:45:37 HumanFish mysqld: InnoDB: File name ./ib_logfile0
Oct 29 09:45:37 HumanFish mysqld: InnoDB: File operation call: 'aio write'.
Oct 29 09:45:37 HumanFish mysqld: InnoDB: Cannot continue operation.
Oct 29 09:45:37 HumanFish mysqld_safe: mysqld from pid file /var/run/mysqld/mysqld.pid ended
Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: 0 processes alive and '/usr/bin/mysqladmin --defaults-file=/etc/mysql/debian.cnf ping' resulted in
Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: #7/usr/bin/mysqladmin: connect to server at 'localhost' failed
Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: error: 'Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)'
Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: Check that mysqld is running and that the socket: '/var/run/mysqld/mysqld.sock' exists!

@uejji I'm no mysql expert but this more related to #223. We don't yet support the aio, most applications in this instance fall back to the normal I/O syscalls.

@behlendorf I see. The errors about O_DIRECT in the log led me here through a Google search. I'll watch that issue in the meantime.

Thanks.

@behlendorf The innodb_use_native_aio option didn't exist by default in my.cnf, but adding it manually worked fine.

Thanks for locating the workaround for me. I guess the eventual goal will be that it's no longer necessary.

Any news about O_DIRECT support?

It can't be honored as zfs is double buffer. o_direct makes no sense
anyway. O_SYNC is a better way.

2014-04-21 14:18 GMT+02:00 pavel-odintsov notifications@github.com:

Any news about O_DIRECT support?

โ€”
Reply to this email directly or view it on GitHubhttps://github.com//issues/224#issuecomment-40931865
.

Well, then a flag which allows 'ignoring' O_DIRECT requests (w/o failing) could be a plus on some situations.

I know this would can be dangerous on some situations, but there are others where this can be assumed, and also, non-advanced users can be notified by emiting some kind of warning, etc. when such a flag is set.

Another option would be providing a flag with three options (ignore, dsync, sync), which would mean:

  • ignore => Simply ignore O_DIRECT flag and just perform as with standard reqs.
  • dsync => Assume O_DIRECT == O_DSYNC.
  • sync => Assume O_DIRECT == O_SYNC.

Greets

Making the behavior of O_DIRECT configurable with a property sounds like it may be a reasonable approach. However, we should be careful not to muddle the meaning of O_DIRECT.

The O_DIRECT flag only indicates that all the kernel caching should be bypassed. Data should be transferred directly to or from the user space process to the physical device. Unlike O_SYNC it makes no guarantees about the durability of the data on disk.

Given those requirements I could see a property which allows the following behavior:

  • disable => O_DIRECT as is not strictly supported.
  • ignore => Simply ignore O_DIRECT flag and just perform as with standard reqs.
  • enable => Never cache these blocks in the ARC. We can't avoid copies which might be made in the pipeline but we can disable the caching.

That sounds pretty neat, and would allow some scenarios not supported right now, even with their own tradeoffs. ;)

newer versions of virt-manager want use cache=none as default for qemu virtual images which in turn means qemu tries to use O_DIRECT and libvirt will throw errors.
the error messages will confuse most users not aware of the fact that ZoL doesn't support O_DIRECT yet.
+1 for any kind of solution

๐Ÿ‘ For this.. I've been experimenting with oVirt as a virtualization manager and I'd love to use ZFS for it's data stores, but as far as I understand, I can't add it as a local data store due to this issue.

It's rather better than "silent ignore O_DIRECT".

After investigating what it will take to support this I'm bumping this functionality from the 0.6.4 tag. To add this functionality we must implement the address_space_operations.direct_IO callback for the ZPL. This will allow us to pin in memory the pages for IO which have been passed by the application. IO can then be performed directly to those pages. This will require us to add an additional interface to the DMU which accepts an struct iov_iter. While this work isn't particularly difficult, it's also not critical functionality and we don't want it to hold up the next release.

ryao commented

@behlendorf We can not just pin the user pages. We also need to mark them CoW so that userland cannot modify them as they are being read. Otherwise, we risk writing incorrect checksums. In the case of compression, userland modification of the pages while the compression algorithm is run would result in undefined behavior and might pose a security risk.

That said, I have a commit that implements O_DIRECT by mapping it to userspace here:

a08c76a

It was written after a user asked for the patch and it is not meant to be merged, but the commit message has a discussion of what O_DIRECT actually means that I will reproduce below:

DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Mac OS X does not
implement O_DIRECT, but it does implement F_NOCACHE, which is similiar
to #2 in that it prevents new data from being cached. AIX relaxes #3 by
only committing the file data to disk. Metadata updates required should
the operations make the file larger are asynchronous unless O_DSYNC is
specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

@ryao thanks for the writeup. I use ZFS zvols as a backing storage for qemu, and am also vaguely familiar with how databases perform IO. Mapping direct_IO to AIO is definitely good first step but it would be great if both 2. and 3. received (eventually) attention as well.

Regarding 2. COW seems definitely like a good direction. I would also expect lower memory utilisation and possibly other performance gains from O_DIRECT if (and only if) compression is not enabled.

Regarding 3. that's interesting one. One can rather trivially (although not cheaply) increase IO subsystem performance by attaching nvme PCIe backed SLOG device; it would be great if ZIL could be used (if configured so - an extra option would be needed) as a primary backing storage mapped to O_DIRECT rather than indirect logging. This would help to preserve the benefits of fast SLOG device i.e. very low latency of synchronous writes; while at the same time guaranteeing data safety and low memory utilisation (primary goals of O_DIRECT in scenarios I am familar with).

As a preliminary yet generic relief I have written an interposer to either map O_DIRECT to O_DSYNC or just ignore it. As most of the source infrastructure required was already there, I integrated it into https://code.uplex.de/liblongpath - ignoring, for the time being, that the main purpose of the liblongpath project was quite different.
The relevant commit is https://code.uplex.de/liblongpath/liblongpath/commit/2e46a921ce2b6b1caa56d39cbd58be85c5988bd0
The commit message contains basic usage info, I have not (yet) added any other documentation for this feature.

ryao commented

@nigoroll Most Linux filesystem drivers, including ZoL, treat O_SYNC and O_DSYNC the same, sothat will not make much difference here.

You can get the indirect logging on all I/Os (O_DIRECT or not) that I mentioned by setting logbias=throughput on the dataset..

@ryao, I fail to understand how your comment relates to the interposer I have written. Its purpose is to provide relief where O_DIRECT cannot be avoided and open() calls returning with EINVAL break applications (which may even be closed source). Basically this implements what @pruiz suggested, but on the level of an interposer library.

ryao commented

@nigoroll I was thinking of this from the perspective of performance, where software using O_DIRECT almost always uses O_SYNC, so it does not improve things over the patch to ZoL to ignore O_DIRECT by mapping it to AIO. It makes more sense when thinking about software compatibility.

Thanks for writing that library.

@ryao, what would you recommend as a best practice for now:

comment or remove the innodb_flush_method variable in /etc/mysql/my.cnf?

On MariaDB this would use fsync() to flush data and logs.

Or should O_DSYNC be used?

https://mariadb.com/kb/en/mariadb/xtradbinnodb-server-system-variables/#innodb_flush_method
O_DSYNC - O_DSYNC is used to open and flush logs, and fsync() to flush the data files.

Values for this setting include:

O_DSYNC
O_Direct
fdatasync
O_DIRECT_NO_FSYNC
ryao commented

@azeemism fdatasync is the best option for MariaDB on ZFS right now. O_DSYNC is the equivalent of calling fdatasync after each and every write operation while neither O_Direct nor O_DIRECT_NO_FSYNC should work on ZFS unless you patch it to implement the ->directIO VFS operation. Patching ZFS to add it would have no benefit in production over using fdatasync at best and at worse, would render MariaDB crash-unsafe.

@azeemism @ryao MariaDB 10.1 throws "[ERROR] InnoDB: Unrecognized value fdatasync for innodb_flush_method" so no more fdatasync. Interestingly the documentation still mentions it: https://mariadb.com/kb/en/mariadb/xtradbinnodb-server-system-variables/#innodb_flush_method

@ryao How much work would it take (on top of your commit earlier) to make O_DIRECT imply primarycache=metadata semantics?

Anyone can comment on that first post in this issue?
That looked into the double buffering matter and hinted it might not be a problem anymore after 2011? Like maci0 said, the other ways seem really crude.
And I'd rather have it disable like it does now than map into fdatasync.

Application: "please don't buffer this at all, I'm trying to optimize here while keeping data safe"
FS: "Sure, I'll not buffer this at all. I'll not slow you down and I'll not lie to you about when the data is on disk"
FS turns around and says "yeah, we'll just flush when application does flushes, heh, it can't be worse than ext3, right?!"

@FlorianHeigl many comments here related to the initial note, just take the time to read them. Things are not that easy, if you want to understand why, I'd recommend @ryao s comment as of 23 Jul 2015

+1. libvirt wont run with those flags:

<disk type='file' device='disk'>
  <driver name='qemu' type='raw' cache='none' io='native'/>

@mcr-ksh although that combination of cache and io is optimal, you can still get libvirt to use volumes hosted on zfs by either not defining cache and io, or by selecting io='threads' and an appropriate cache policy

It would be really great if we could get this simple sounding fix (create a shim to AIO) of @ryao:

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If this is only a simple hack that does not imply the drawbacks he mentioned in 1-4, could it be implemented and e.g. activated by a filesystem property of some sort?

I also have problems adding the cache=none parameter to libvirt XML.
Please add direct IO support.

@kpande I can confirm it works . Although performance is not great , compared to cache=writeback io=threads . However I thought that cache=directsync io=native is the one to use , for direct IO ?

@kpande Yes, but I'm not using ZVOL. It's just the RAW file which is stored in the dataset.

@Bronek According to the docs, directsync is described as follows: This mode causes qemu-kvm to interact with the disk image file or block device with both O_DSYNC and O_DIRECT semantics, where writes are reported as completed only when the data has been committed to the storage device, and when it is also desirable to bypass the host page cache. Like cache=writethrough, it is helpful to guests that do not send flushes when needed. It was the last cache mode added, completing the possible combinations of caching and direct access semantics.

I think you're right. Seems like it is very similar to writeback, except for the performance impact.

@Vringe Don't store RAW files as ZFS as files. There is no benefit and it is not as fast as it could be. ZVOL is perfect for that and you can snapshot a VM at a time and you have a constant (vol)block size.

@lnxbil Each VM has it's own dataset, so I can easily create snapshots. During my performance tests, I found out that ZVOLs are not really better performing than datasets. There were also some really weird problems I had with ZVOLs. The environment is running well. I just want to use swap on the guest machines instead of using the hosts cache.

@Vringe yet you cannot use trim and therefore have not so efficient thin provisioning. You can also have bad write amplification with a dataset if you have not set the recordsize properly. If you have different access pattern from you emulation layer. zvol will always use the volblocksize

That is why I cannot use Ovirt with Local Storage configuration for years, because it wants DIRECT_IO.

use zvol, O_DIRECT works fine there.

Do you use Ovirt ?

Any news on real o_direct support since scatter-gather list gets merged? Just curious, thanks

@jumbi77 I think we have to consider what O_DIRECT actually means for ZFS ZPL.

Historically in the past for other file-systems it more-or-less meant that the data was transferred from the disks directly into the userspace buffers without any intermediate buffering, but with newer storage stacks that mapping isn't possible (this is true for non ZFS in many cases too).

@behlendorf I suggest O_DIRECT really means O_SYNC with "as little buffering as possible".

I think it means "as little buffering as possible". If I want O_SYNC, I'll say O_SYNC (instead of or in addition to O_DIRECT). The open(2) man page on Linux explicitly says that it doesn't guarantee the same semantics as O_SYNC and you need to pass both if you want both.

We are using ZFS on all of our production machines (mostly Solaris an Linux) and our backup strategy is based upon ZFS-snapshots.

So far we only used Oracle Databases on Solaris machines and Oracle runs just fine on ZFS. There's even a Sun-whitepaper with information about optimal ZFS-configuration for Oracle databases.

Unfortunately Oracle fails under Linux if Archive Log files are stored within a ZFS volume and I would not mention this here if Direct IO wasn't the culprit. Here's a single line from strace-output:

open("/var/oracle/diag/rdbms/b1/B1/metadata/ADR_INTERNAL.mif", O_RDONLY|O_DIRECT) = -1 EINVAL (Invalid argument)

I'm aware of the following 3 possible solutions:

  • adding pseudo direct IO support to ZFS
  • ignoring O_DIRECT within ZFS
  • ignoring O_DIRECT outside of ZFS

I tried to ignore O_DIRECT within ZFS first and my idea was to find the line of source where ZFS refuses to open a file with O_DIRECT-flag. But searching within the ZFS source code for O_DIRECT resulted in almost nothing. Seems like the ZFS-software does not reject open() calls with O_DIRECT but the VFS layer knows that ZFS is lacking DirectIO-support and therefore rejects open() calls with O_DIRECT.

Is that correct? Can I patch my kernel such that VFS does ignore O_DIRECT for ZFS filesystems?

So I tried to add direct IO support to ZFS. Have a look at this patch from 2015. It does not work with current ZFS and I doubt that i has worked with former versions (there are unbalanced #if-/#endif-lines and int rw = iov_iter_rw(iter); uses the undeclared variable iter).

But the idea should work: Adding a zpl_direct_IO() routine to zpl_file.c and adding this routine to the zpl_address_space_operations structure.

I tried that but i did not bother with the config-macros that detect what kind of VFS-API is in use with my 4.4.113-kernel. Here's what I added to zpl_file.d:

static size_t
zpl_direct_IO(struct kiocb *kiocb, struct iov_iter *from, loff_t offset)
{
        if(iov_iter_rw(from)== WRITE){
                return (zpl_iter_write_common(kiocb, from->iov, from->nr_segs, kiocb->ki_nbytes));
        }
        return (zpl_iter_read_common(kiocb, iovp, nr_segs, kiocb->ki_nbytes));
}

const struct address_space_operations zpl_address_space_operations = {
        .readpages      = zpl_readpages,
        .readpage       = zpl_readpage,
        .writepage      = zpl_writepage,
        .writepages     = zpl_writepages,
        .direct_IO      = zpl_direct_IO
};

This does not work. Calling zpl_iter_write_common() with 3 parameters was copy&pasted from the 2015-patch, but zpl_iter_write_common() has 6 parameters right now. To make this work I need some expert-advise.

How do I add the missing parameters to zpl_iter_write_common() and zpl_iter_read_common()? Does that make sense at all?

Since adding direct IO support into my kernel with the above hack failed I decided to ignore O_DIRECT outside of ZFS. LD_PRELOAD is your friend if you try to replace a system call with something else. In my case I created the library libOpenWithoutDirectIO.so from the following source code:

/* compile this with
   gcc -Wall -fPIC -shared -o libOpenWithoutDirectIO.so thisfile.c
   and use
   export LD_PRELOAD=/path/to/libOpenWithoutDirectIO.so
*/
#define _GNU_SOURCE
#include <dlfcn.h>
#include <fcntl.h>
#include <stdarg.h>

int open(const char *path, int flags, ...){
        static int (*func)(const char *path, int flags, ...);

        if(!func) func=dlsym(RTLD_NEXT,"open");
        flags &= ~O_DIRECT;
        if(flags & O_CREAT){
                va_list a; mode_t mode;
                va_start(a,flags); mode=va_arg(a,mode_t); va_end(a);
                return func(path, flags, mode);
        }
        return func(path, flags);
}

int open64(const char *path, int flags, ...){
        static int (*func)(const char *path, int flags, ...);

        if(!func) func=dlsym(RTLD_NEXT,"open64");
        flags &= ~O_DIRECT;
        if(flags & O_CREAT){
                va_list a; mode_t mode;
                va_start(a,flags); mode=va_arg(a,mode_t); va_end(a);
                return func(path, flags, mode);
        }
        return func(path, flags);
}

This will remove O_DIRECT on every open()/open64() system call. I don't like this because O_DIRECT should be removed if and only if path is pointing to a file that is located within a ZFS-volume. With the above hack O_DIRECT is removed from every open()/open64() call.

But Orace is running now on top of ZFS.

Any comments?

Peter

Hi Peter,

This will remove O_DIRECT on every open()/open64() system call. I don't like this because O_DIRECT should be removed if and only if path is pointing to a file that is located within a ZFS-volume. With the above hack O_DIRECT is removed from every open()/open64() call.

I tried a similar thing over a year ago and my listener was not able to work in xml mode, only plaintext mode. I could not patch that, yet the database was working. I have to search the writeup of my work at home and post it here. It's so funny to see that we both tried to solve the problem similarly. I went also one step further and tried to patch glibc to just use it everywhere, transparently, yet that did not work in the limited time I had. I finished my investigation after spending over 10h on that topic.

@lnxbil I can tell you that my interposer does the job.

@behlendorf acf0ade seems unrelated to Direct IO... did you mean to close this one?

@au-phiware whoops, no I did not. It was accidentally caused by merging the SPL and it's history in to the ZFS repository. See PR #7556, we'll probably have a few more of these.

What is the progress on this? I tried installing OVirt, but, as Oirt needs direct IO, the installation failed.

My workaround is to use a ZVOL with XFS in it.

@behlendorf any update on the matter? I agree that O_DIRECT can be implemented by simply treating it as a hint ZFS can ignore. As another step, it would be great if O_DIRECT requests pollute the ARC as little as possible (basically what you suggested in your comment on 22 Apr 2014).

No one I'm aware of is working on this. If someone would like to take a crack at it I'm happy help with the design, which could initially be the basic one described above, and review the changes.

@behlendorf Well, I just tried on FreeBSD 11.x a small C program with O_DIRECT support [1] and it really seems O_DIRECT is ignored: writes go into ARC and are served from it when data is read. ZFS compression for the dataset it off.

This do not surprise me: O_DIRECT implies zero-memory-copy and/or DMA from main memory to the disk themselves. While with standard filesystem this should be possible, with CoW+checksum (and anything which transforms data when they flow, ie: compression) this become very difficult.

# Before running the test program:
ARC Size:                               0.09%   1.14    MiB
        Target Size: (Adaptive)         100.00% 1.20    GiB
        Min Size (Hard Limit):          12.50%  153.30  MiB
        Max Size (High Water):          8:1     1.20    GiB

# After running it:
ARC Size:                               48.65%  596.61  MiB
        Target Size: (Adaptive)         100.00% 1.20    GiB
        Min Size (Hard Limit):          12.50%  153.30  MiB
        Max Size (High Water):          8:1     1.20    GiB

# Reading the just-written file shows data are server by ARC (ie: too fast for coming from the disk)
root@freebsd:~ # dd if=/tank/test.img of=/dev/null bs=1M
512+0 records in
512+0 records out
536870912 bytes transferred in 0.188852 secs (2842809718 bytes/sec)

[1] Test program:

root@freebsd:~ # cat test.c
#define _GNU_SOURCE
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#define BLOCKSIZE 128*1024

int main()
{
        void *buffer;
        int i = 0;
        int w = 0;
        buffer = malloc(BLOCKSIZE);
        buffer = memset(buffer, 48, BLOCKSIZE);
        int f = open("/tank/test.img", O_CREAT|O_TRUNC|O_WRONLY|O_DIRECT);
        for (i=0; i<512*8; i++) {
                w = write(f, buffer, BLOCKSIZE);
        }
        close(f);
        free(buffer);
        return 0;
} 

Am I correct? Would be a simple "ignore O_DIRECT" policy be acceptable in current ZoL?
Thanks.

@ryao If I understand it correctly, your old patch a08c76a basically ignores O_DIRECT by serving reads/writes using normal AIO functions.

Any chances to update it for newer ZFS / kernel releases?

Would be a simple "ignore O_DIRECT" policy be acceptable in current ZoL?

Something verify similar would acceptable. The approach taken by @ryao is an excellent start but we need to incorporate two additional changes to stay consistent with the intent of O_DIRECT. Otherwise we risk breaking existing applications which depend on this behavior.

  • O_DIRECT should to behave as if O_SYNC were set.
    [edit] Requirement removed, the open(2) man page explicitly says this is not guaranteed.

  • As @rlaager suggested O_DIRECT IO's should imply primarycache=metadata for those blocks.

Both of these should be relatively easy to implement since all of the needed functionality already exists.

I disagree with the idea that O_DIRECT should imply O_SYNC. My interest here is with cache=none for qemu. As was already mentioned, qemu has cache=none (O_DIRECT) and cache=directsync (both O_DSYNC and O_DIRECT). If someone wants both O_DIRECT and O_SYNC, they can ask for both.

As @rlaager suggested, O_DIRECT should not imply O_SYNC. From open() man page:

O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user-
space buffers. The O_DIRECT flag on its own makes an effort
to transfer data synchronously, but does not give the
guarantees of the O_SYNC flag that data and necessary metadata
are transferred. To guarantee synchronous I/O, O_SYNC must be
used in addition to O_DIRECT. See NOTES below for further
discussion.

In other words, O_DIRECT is somewhat similar to O_DSYNC but without the necessary I/O barriers (ie: fsync() and ATA FLUSH/FUA) to really immediately commit data to stable storage.

For a first implementation even simply ignoring O_DIRECT (similar to FreeBSD) should be better than current behavior (where open() with O_DIRECT fails with an error). If we can wire O_DIRECT with primarycache=metadata this would be great, however.

Thanks for explicitly calling out what the man page has to say about this. Given that, I agree we just want the minimal caching behavior.

I've opened #7823 with an updated version of @ryao's original patch. It does not implement the primarycache=metadata suggestion and instead behaves the same way as Illumos and FreeBSD. After some investigation I decided there were additional complexities which needed more analysis and would be better tackled at a latter date. See the PR for additional details.