STEC EnhanceIO SSD Caching Software 27th March, 2018 0. THIS IS elmystico/EnhaceIO VERSION Code is ment to be used with initramfs/udev. The goal is to start as much as it is possible before local filesystems get mounted to allow root / partition and swaps to be cached. This code will generate: - udev rules to be put /etc/udev/rules - by initramfs hook script put /etc/initramfs-tools for automatic initrd generation for: - eio_cli binary to be used inside initramfs- this needs PyInstaller. This way one can cache root tilesystem and swap partitions as well! (but no write-back cache for root (boot) partition) Some patches from others from github for kernel part included. This is currently tested up to kernel 4.15 but only Debian distro. I will gradually add some description here. I'm looking for some help here! Please contact me on github if you want to do something good with me or want something to be implemented by me or have some constructive thoughts. THANKS! 1. WHAT IS ENHANCEIO? EnhanceIO driver is based on EnhanceIO SSD caching software product developed by STEC Inc. EnhanceIO was derived from Facebook's open source Flashcache project. EnhanceIO uses SSDs as cache devices for traditional rotating hard disk drives (referred to as source volumes throughout this document). EnhanceIO can work with any block device, be it an entire physical disk, an individual disk partition, a RAIDed DAS device, a SAN volume, a device mapper volume or a software RAID (md) device. The source volume to SSD mapping is a set-associative mapping based on the source volume sector number with a default set size (aka associativity) of 512 blocks and a default block size of 4 KB. Partial cache blocks are not used. The default value of 4 KB is chosen because it is the common I/O block size of most storage systems. With these default values, each cache set is 2 MB (512 * 4 KB). Therefore, a 400 GB SSD will have a little less than 200,000 cache sets because a little space is used for storing the meta data on the SSD. EnhanceIO supports three caching modes: read-only, write-through, and write-back and three cache replacement policies: random, FIFO, and LRU. Read-only caching mode causes EnhanceIO to direct write IO requests only to HDD. Read IO requests are issued to HDD and the data read from HDD is stored on SSD. Subsequent Read requests for the same blocks are carried out from SSD, thus reducing their latency by a substantial amount. In Write-through mode - reads are handled similar to Read-only mode. Write-through mode causes EnhanceIO to write application data to both HDD and SSD. Subsequent reads of the same data benefit because they can be served from SSD. Write-back improves write latency by writing application requested data only to SSD. This data, referred to as dirty data, is copied later to HDD asynchronously. Reads are handled similar to Read-only and Write-through modes. 2. WHAT HAS ENHANCEIO CHANGED TO FLASHCACHE? 2.1. A new write-back engine The write-back engine in EnhanceiO has been designed from scratch. Several optimizations have been done. IO completion guarantees have been improved. We have defined limits to let a user control the amount of dirty data in a cache. Clean-up of dirty data is stopped by default under a high load; this can be overridden if required. A user can control the extent to which a single cache set can be filled with dirty data. A background thread cleans-up dirty data at regular intervals. Clean-up is also done at regular intevals by identifying cache sets which have been written least recently. 2.2. Transparent cache EnhanceIO does not use device mapper. This enables creation and deletion of caches while a source volume is being used. It's possible to either create or delete cache while a partition is mounted. EnhanceIO also supports creation of a cache for a device which contains partitions. With this feature it's possible to create a cache without worrying about having to create several SSD partitions and many separate caches. 2.3. Large I/O Support Unlike Flashcache, EnhanceIO does not cause source volume I/O requests to be split into cache block size pieces. For the typical SSD cache block size of 4 KB, this means that a write I/O request size of, say, 64 KB to the source volume is not split into 16 individual requests of 4 KB each. This is a performance improvement over Flashcache. IO codepaths have been substantially modified for this improvement. 2.4. Small Memory Footprint Through a special compression algorithm, the meta data RAM usage has been reduced to only 4 bytes for each SSD cache block (versus 16 bytes in Flashcache). Since the most typical SSD cache block size is 4 KB, this means that RAM usage is 0.1% (1/1000) of SSD capacity. For example, for a 400 GB SSD, EnhanceIO will need only 400 MB to keep all meta data in RAM. For an SSD cache block size of 8 KB, RAM usage is 0.05% (1/2000) of SSD capacity. The compression algorithm needs at least 32,768 cache sets (i.e., 16 bits to encode the set number). If the SSD capacity is small and there are not at least 32,768 cache sets, EnhanceIO uses 8 bytes of RAM for each SSD cache block. In this case, RAM usage is 0.2% (2/1000) of SSD capacity for a cache block size of 4K. 2.5. Loadable Replacement Policies Since the SSD cache size is typically 10%-20% of the source volume size, the set-associative nature of EnhanceIO necessitates cache block replacement. The main EnhanceIO kernel module that implements the caching engine uses a random (actually, almost like round-robin) replacement policy that does not require any additional RAM and has the least CPU overhead. However, there are two additional kernel modules that implement FIFO and LRU replacement policies. FIFO is the default cache replacement policy because it uses less RAM than LRU. The FIFO and LRU kernel modules are independent of each other and do not have to be loaded if they are not needed. Since the replacement policy modules do not consume much RAM when not used, both modules are typically loaded after the main caching engine is loaded. RAM is used only after a cache has been instantiated to use either the FIFO or the LRU replacement policy. Please note that the RAM used for replacement policies is in addition to the RAM used for meta data (mentioned in Section 2.1). The table below shows how much RAM each cache replacement policy uses: POLICY RAM USAGE ------ --------- Random 0 FIFO 4 bytes per cache set LRU 4 bytes per cache set + 4 bytes per cache block 2.6. Optimal Alignment of Data Blocks on SSD EnhanceIO writes all meta data and data blocks on 4K-aligned blocks on the SSD. This minimizes write amplification and flash wear. It also improves performance. 2.7. Improved device failure handling Failure of an SSD device in read-only and write-through modes is handled gracefully by allowing I/O to continue to/from the source volume. An application may notice a drop in performance but it will not receive any I/O errors. Failure of an SSD device in write-back mode obviously results in the loss of dirty blocks in the cache. To guard against this data loss, two SSD devices can be mirrored via RAID 1. EnhanceIO identifies device failures based on error codes. Depending on whether the failure is likely to be intermittent or permanent, it takes the best suited action. 2.8. Coding optimizations Several coding optizations have been done to reduce CPU usage. These include removing queues which are not required for write-through and read-only cache modes, splitting of a single large spinlock, and more. Most of the code paths in flashcache have been substantially restructured. 2.9 Sequential I/O bypass EnhanceIO has removed the bypass of sequential IO available in flashcache. The sequential detection logic has a limited use case, espescially in a reasonably multithreaded scenario. 3. EnhanceIO usage 3.1. Cache creation, deletion and editing properties eio_cli utility is used for creating and deleting caches and editing their properties. Manpage for this utility eio_cli(8) provides more information. 3.2. Making a cache configuration persistent It's essential that a cache be resumed before any applications or a filesystem use the source volume during a bootup. If a cache is enabled after a source volume is written to, stale data may be present in the cache. It may cause data corruption. The document Persistent.txt describes how to enable a cache during bootup using udev scripts. In case an SSD does not come up during a bootup, it's ok to allow read and write access to HDD only in the case of a Write-through or a read-only cache. A cache should be created again when SSD becomes available. If a previous cache configuration is resumed, it may cause stale data to be read. 3.3. Using a Write-back cache It's absolutely necessary to make a Write-back cache configuration persistent. This is required particularly in the case of an OS crash or a power failure. A Write-back cache may contain dirty blocks which haven't been written to HDD yet. Reading the source volume without enabling the cache will cause incorrect data to be read. In case an SSD does not come up during a bootup, access to HDD should stopped. It should be enabled only after SSD comes-up and a cache is enabled. Write-back cache needs to perform clean operation in order to flush the dirty data to the source device(HDD). The clean can be either trigerred by the user or automatically initiated, based on preconfigured thresholds. These thresholds are described below. They can be set using sysctl calls. a) Dirty high threshold (%) : The upper limit on percentage of dirty blocks in the entire cache. b) Dirty low threshold (%) : The lower limit on percentage of dirty blocks in the entire cache. c) Dirty set high threshold (%) : The upper limit on percentage of dirty blocks in a set. d) Dirty set low threshold (%) : The lower limit on percentage of dirty blocks in a set. e) Automatic clean-up threshold : An automatic clean-up of the cache will occur only if the number of outstanding I/O requests from the HDD is below the threshold. f) Time based clean-up interval (minutes) : This option allows you to specify an interval between each clean-up process. Clean is trigerred when one of the upper thresholds or time based clean threshold is met and stops when all the lower thresholds are met. 4. ACKNOWLEDGEMENTS STEC acknowledges Facebook and in particular Mohan Srinivasan for the design, development, and release of Flashcache as an open source project. Flashcache, in turn, is based on DM-Cache by Ming Zhao.