cirrostratus: A C repository from bysshe

GG's AoE target
---------------

ggaoed is an AoE target implementation for Linux, distributed under the terms
of the GNU GPL, version 2 or later. ggaoed has the following features:

- A single process can handle any number of devices and any number of
  network interfaces
- Uses kernel AIO to avoid blocking on I/O
- Request merging: read/write requests for adjacent data blocks can
  be submitted as a single I/O request
- Request batching: multiple I/O requests can be submitted with a
  single system call
- Supports hotplugging/unplugging of network interfaces
- Uses eventfd for receiving notifications about I/O completion
- Uses epoll for handling event notifications
- Uses memory mapped packets to lower system call overhead when receiving and
  sending data
- Devices to export can be identified either by path or by UUID (using the
  libblkid library)
- Delayed I/O submission utilizing timerfd (experimental)

Motivation
----------

I wanted to play with Linux AIO, eventfd, memory mapped packets etc., and I
needed a project that can utilize all of these. The de facto standard solution
("write yet another small web server") seemed boring, so ggaoed was born.

This also means ggaoed is not meant to be portable.

Requirements
------------

The following software is needed in order to build ggaoed:

- glibc 2.8 (built on Linux kernel 2.6.27 or later)
- libaio 0.3.107
- libatomic_ops 1.2
- glib 2.12
- libblkid-dev
- docbook2x 0.8 (for building the man pages)

If you are using a recent enough major Linux distribution chances are high that
you can find the above already packaged, just install the suitable devel
packages.

Running ggaoed requires Linux kernel 2.6.27 or later. Using the memory mapped
packet interface for sending data requires kernel 2.6.31 or later.

Building and installing
-----------------------

Just run
	
	./configure
	make
	make install

You have to copy ggaoed.conf.dist to $(sysconfdir) and edit it according to
your needs.

Performance issues
------------------

io_submit() can block if you're not using direct I/O and the required data is
not in the page cache. That means that if you have one device using buffered
I/O, that device may block the processing of requests of other devices. Using
jumbo frames can reduce the impact if the client generally submits page aligned
I/O requests.

Even when using direct I/O, mapping the offset in the request to physical
location on the disk still happens synchronously. Example: if you're exporting
a file rather than a block device, then the location of the file must be read
from the disk synchronously before the requested I/O operation can be queued
to execute asynchronously. LVM also has a negative impact on I/O submission
latency but much smaller than a file system.

Performance wise the best is to export raw disk devices or partitions. The
ease of administration provided by LVM may well justify the slight performance
impact it brings. Export regular files only if you do not care about
performance.

io_submit() can block if the kernel runs out of block layer request slots. You
can set the limit in /sys/block/<disk>/queue/nr_requests.

There is a system-wide limit on the number of AIO requests. You can set the
limit in /proc/sys/fs/aio-max-nr.

Use jumbo frames if you want performance. The recommended MTU size is 9000 as
this is the most common size supported by most gigabit network equipment. You
can use larger MTU sizes but make sure all components (initiator, target, and
any switches between them) support the MTU size you want to use.

Use ggaoectl to monitor the performance of the daemon. If there are dropped
packets, increase the ring buffer size. A ring buffer that is too small can
cause severe performance drop because the client will have to re-transmit
often.

Use manageable switches and monitor packets dropped by the switch. If the
target ggaoed has higher bandwidth than the initiators (i.e. the target
is on a 10GigE link while the initiators has only 1GigE), then large queue
sizes can cause excessive packet drops which in turn kill performance.
Avoid ethernet flow control, you can get better performance by reducing the
queue lengths.

About the memory mapped ring buffers
------------------------------------

ggaoed uses a ring buffer mapped in memory to transfer packets to and from
the kernel to reduce the number of system calls. There are two rings: one
for receiving and one for sending data (the latter requires kernel version
2.6.31 or later). Each ring consists of a number of blocks, and every
block contains one or more full frames (packets). ring-buffer-size in the
config file tells the combined size of the two ring buffers.

The memory allocated for a single block must be physically contiguous.
Allocating a large physycally contiguous memory block can be hard especially
when the machine is running for some time and the memory gets fragmented.
You can try to increase /proc/sys/vm/min_free_kbytes if fragmentation is
a problem, but that is not a real solution either.

ggaoed uses a block size of 64 KiB by default, which means 7 jumbo frames fit
inside a single block. If allocating the ring buffer fails, ggaoed tries
smaller block sizes, but that means more memory will be wasted (since a block
can contain only full frames) and less frames (packets) can be placed inside
the ring buffer.

Here is a single table showing the relation between MTU, block size and the
proportion of memory wasted:

	MTU	Block size	Frames/block	Wasted
	----------------------------------------------
	9000	64k		7		 3.3%
	9000	32k		3		17.1%
	9000	16k		1		44.7%
	4608	64k		14		 0.6%
	4608	32k		7		 0.6%
	4608	16k		3		14.8%
	4608	 8k		1		43.2%

Due to kernel internals, the number of blocks must not exceed the number
of pointers that a single page can hold, that is, getpagesize() /
sizeof(void *). Putting it together, the maximum size of a single ring
buffer is 32 MiB on 64-bit architectures, and 64 MiB on 32-bit architectures,
provided that ggaoed can use the default 64 KiB block size. If the block
size has to be reduced to 32 KiB due to memory fregmentation, then the
maximum ring buffer size is halved as well etc.

Acknowledgements
----------------

Some ideas have been incorporated from vblade (http://aoetools.sf.net) and
qaoed (http://code.google.com/p/qaoed).

License
-------

GPL v2 or later. See COPYING for the details.

Gábor Gombás <gombasg@digikabel.hu>
bysshe/cirrostratus