Performance Tuning
dereks-igor opened this issue · 8 comments
Hello! Thank you for publishing Dhara.
I have been playing with it and wanted to share some performance numbers. I am using a NAND Flash chip on an embedded system. When I use my raw nand driver, I get the following numbers writing and reading full pages. (This output is from my UART shell:)
# nand_write_read
..............................
Write 3840.0 KB at 986.988 KB/s
..............................
Read 3840.0 KB at 1998.049 KB/s
Done. Marked 0 this run, 0 total
#
However, when I run a Dhara write/read loop on a freshly erased NAND (with a brand new map/journal), I get these numbers:
# dhara_write_read
................................
Write 4000.0 KB at 620.230 KB/s
................................
Read 4000.0 KB at 271.348 KB/s
#
I was surprised that reads are slower than writes. Dhara sector reads are 86% slower than raw NAND page reads. Dhara sector writes are 37% slower than raw NAND page writes.
Through logging, I can see that this is because of two reasons:
- Dhara walks the Radix tree from the root for every sector lookup
- Dhara does a partial page read each time it walks a node of the tree
Dhara walks the Radix tree from the root for every sector lookup
Here is the comment at the top of trace_path()
:
/* Trace the path from the root to the given sector, emitting
* alt-pointers and alt-full bits in the given metadata buffer.
Since the vast majority of sector accesses are sequential, meaning the next target sector will be sector+1 (since file data is read or written in big streams), it seems like there could be an easy optimization to not search the entire tree every time. Perhaps it could remember the last sector found, and then start searching at the sibling node of the Radix tree. I looked at the loop but did not see an obvious way to implement that.
Dhara does a partial page read each time it walks a node of the tree
Here are some logs showing Dhara's partial page reads of metadata as it walks the tree. Notice that it does ~10 small metadata reads of len 132 bytes each, before it has the page address and can read the full page of len 2048 bytes (from offset 0) that the user is seeking.
E (11422) nand: nand_read_page: blk 18, pg 15, pg_addr 0x48F, pg_offset 548, p_data 0x20009354, len 132
E (11423) nand: nand_read_page: blk 9, pg 47, pg_addr 0x26F, pg_offset 284, p_data 0x20009354, len 132
E (11424) nand: nand_read_page: blk 5, pg 31, pg_addr 0x15F, pg_offset 152, p_data 0x20009354, len 132
E (11425) nand: nand_read_page: blk 3, pg 15, pg_addr 0xCF, pg_offset 1076, p_data 0x20009354, len 132
E (11426) nand: nand_read_page: blk 2, pg 15, pg_addr 0x8F, pg_offset 548, p_data 0x20009354, len 132
E (11427) nand: nand_read_page: blk 1, pg 47, pg_addr 0x6F, pg_offset 284, p_data 0x20009354, len 132
E (11428) nand: nand_read_page: blk 1, pg 31, pg_addr 0x5F, pg_offset 152, p_data 0x20009354, len 132
E (11429) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 1076, p_data 0x20009354, len 132
E (11430) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 548, p_data 0x20009354, len 132
E (11431) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 284, p_data 0x20009354, len 132
E (11432) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 152, p_data 0x20009354, len 132
E (11433) nand: nand_read_page: blk 1, pg 1, pg_addr 0x41, pg_offset 0, p_data 0x2000d1c0, len 2048
.E (11435) nand: nand_read_page: blk 18, pg 15, pg_addr 0x48F, pg_offset 548, p_data 0x20009354, len 132
E (11436) nand: nand_read_page: blk 9, pg 47, pg_addr 0x26F, pg_offset 284, p_data 0x20009354, len 132
E (11437) nand: nand_read_page: blk 5, pg 31, pg_addr 0x15F, pg_offset 152, p_data 0x20009354, len 132
E (11438) nand: nand_read_page: blk 3, pg 15, pg_addr 0xCF, pg_offset 1076, p_data 0x20009354, len 132
E (11439) nand: nand_read_page: blk 2, pg 15, pg_addr 0x8F, pg_offset 548, p_data 0x20009354, len 132
E (11440) nand: nand_read_page: blk 1, pg 47, pg_addr 0x6F, pg_offset 284, p_data 0x20009354, len 132
E (11441) nand: nand_read_page: blk 1, pg 31, pg_addr 0x5F, pg_offset 152, p_data 0x20009354, len 132
E (11442) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 1076, p_data 0x20009354, len 132
E (11443) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 548, p_data 0x20009354, len 132
E (11444) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 284, p_data 0x20009354, len 132
E (11445) nand: nand_read_page: blk 1, pg 2, pg_addr 0x42, pg_offset 0, p_data 0x2000d1c0, len 2048
...etc...
The large number of semi-random accesses to different blocks and pages is the reason for Dhara's read slowness on my system. There is a considerable amount of overhead to kick off a transaction -- a 132 byte metadata read takes just about as long as a full 2048 byte full page read, due to bus and DMA setup and kickoff.
It seems like there is an opportunity to cache and avoid all these NAND I/O calls. Either by caching ten metadata entries (since it seems to follow mostly the same tree path in sequential block reads, and they are only 132 bytes each), or by having one or two full-page caches at the driver level (so that, for example, all those reads to block 1, page 15 could be reduced to just one full-page read, and then the various offset reads for metadata could come from RAM).
This is just for your information only. Dhara's performance already meets my system requirements, so I don't plan to implement any optimizations. But it seems like there is some low-hanging fruit that could result in significant performance improvements.
Thanks again!
Daniel,
Thank you for your response.
cache holding a few full pages at the driver level
I agree that it would be preferable to cache things at the NAND driver level. One of the most attractive features of Dhara is that it is clean and minimalist, just ~2000 lines of simple C code.
But looking at the above example, there are 11 (for page 1, because it's odd) or 10 (for page 2, because it's even) lookups spread across 8 different pages:
Block.Page.Offset
-----------------
18.15.548
9.47.284
5.31.152
3.15.1076
2.15.548
1.47.284
1.31.152
1.15.1076
1.15.548
1.15.284
To achieve a 100% cache hit success between page 1 and page 2 it would take
8 x 2048 B = 16384 B
of page buffers to cache the full lookup.
However, if those metadata chunks were cached individually, it would only take
10 x 132 B = 1320 B
of metadata buffers to cache the full lookup. That's 92% less RAM to get the same performance increase.
I think it would still be possible to implement the cache at the NAND driver level, rather than in Dhara directly. For any read of size that just happens to be of DHARA_META_SIZE (132 B), add it to the cache, indexed by Block.Page.Offset.
Quick update: I wrote a quick, simplistic metadata cache that caches the 11 most recently requested 132 B metadata chunks. It's just an array, and it uses simple brute-force for loops to search for the requested page address and offset. It's 252% faster. Not bad for 100 lines of code!
# dhara_write_read
................................
Write 4000.0 KB at 625.439 KB/s
................................
Read 4000.0 KB at 683.122 KB/s
#
I would definitively love this improvement!
@dereks-igor Appreciate this has been open for a while. Any chance you could share a patch, even if its rough for your caching implementation? I'm likely about to implement something similar. Would be good to have a more or less working starting point.
Following as well!
I was thinking, would journaling features still be guaranteed when implementing caching at the driver level?
In theory yes, as we only cache reads?