martijnvanbrummelen/nwipe

Use ISAAC-64 on 64-bit systems

chkboom opened this issue · 16 comments

The performance of ISAAC on 64-bit systems may be improved by using the 64-bit version: https://www.burtleburtle.net/bob/rand/isaacafa.html

Perhaps a few #ifdefs can be used to select it at compile time, obviously doesn't make sense to select it at run time.

Closed by #398

@chkboom @Firminator

I've built 64 and 32bit versions of shredos and performed a number of PRNG streams using both Isaac32 and Isaac64 , no verification, no blanking so just a single prng pass and nothing else.

All these tests were run on the same 64bit hardware on the same 2TB hard drive. The results were not quite what I was expecting.

Firstly, for what does make sense, Isaac64 was always faster than Isaac32 when using ShredOS 64 bit,so no issue there. When using ShredOS 32 bit, Isaac32 & Isaac64 are identical in speed, however now for what doesn't make any sense. When running a prng wipe using ShredOS 32bit the wipe is considerably faster than when using ShredOS 64 bit. Almost knocking one hour off the wipe. This is the opposite of what I expected, I expected that ShredOS 64bit would be faster then ShredOS 32bit when both are run on a 64bit processor. See graph below.

There are processor variants that I build for and the x86_64 version uses the 'nocona', I don't think this has anything to do with it .. but maybe I'll look at that a little closer. Very odd that Isaac 32&64 code build for the very first Pentium (i586) runs faster on a 64 bit machine than the 64 bit code does.

I've looked at the wipe using hexedit for both fast 32 bit wipes and it's certainly producing random data.

So running a prng wipe using ShredOS 32 bit on a 64 bit processor knocks 56 minutes of a 2TB Isaac32 wipe compared to the same wipe with ShredOS 64bit.

And using ShredOS 32 bit on a 64 bit processor knocks 38 minutes of a 2TB Isaac64 wipe compared to the same wipe with ShredOS 64bit.

So on this I7-3770 3.4GHz processor it's faster using the 32 version of ShredOS than the 64bit version !? Very odd.

zero_vs_Isaac36_vs_Isaac32_vs_Mersenne_speed_comparison

That is strange. What speeds do you get with zero fills? Also what are the compile flags? Maybe some optimizations are being enabled differently for each one?

Also did you wait for some time between the 32-bit and 64-bit tests? If you performed the 64-bit test immediately after the 32-bit test, you're dealing with heated components, so there may have been some throttling effects.

@PartialVolume I'm not sure on this one. I think what chkboom asks right there to check the compile flags sounds about right. And the ZeroFill speed, yes, that would be a good comparison, and might help to channel in on maybe just the ISSAAC code instead of compile options or all the other code in nwipe.

What speeds do you get with zero fills?

Just running the zerofills now, I'll update the graph later today.

Maybe some optimizations are being enabled differently for each one?
Yes, maybe so. When comparing the 32bit build to the 64bit build the 32 bit is a copy of the 64bit with two changes to the buildroot configuration. I change the config from x86_64 to i386 and change the processor variant from nocona to i586. So the entire code is being build for the lowest common denominator in terms of processor design so shredos will run on the broadest range of hardware. In the case of ShredOS 32 bit that's all processors from the first 32bit Pentium and in the case of x86_64 that's the first 64bit Pentium 4 from about 2004. I'm not 100% sure about this but the nocona variant is referring to a processor design that came about also in 2004 on the xeon processors. Don't quote me on any of the though, as my knowledge is a bit flaky in that area.

Also did you wait for some time between the 32-bit and 64-bit tests?

There was probably up to an hour between wipes, although I don't think it's got anything to do with hardware throttling. The ambient room temperature is 17 deg.C, the 2TB drive is outside the cabinet and reaches a maximum temperature of 33 deg.C during the wipe. The CPU core is showing nwipe using 2% CPU. I don't think it has anything to do with the hardware for the following reason. When you watch the disk activity LED, the LED is extinguished when the nwipe command is executing the write block command. On ShredOS 64bit the LED is extinguished for about 1 second while on ShredOS 32 the LED is extinguished for less than 0.5 seconds. The LED on vs off period is 4 seconds on 1 second off and is consistently the same throughout the entire wipe. As is typical of a traditional drive the MB/s is faster at the start of the wipe and has progressively slowed slightly by the time the end of the disc is reached due to the outer cylinders containing more data for one rotation than the inner cylinders.

If I didn't know better I'd say the block that's being written is twice as large on a 64bit system. I don't actually know that but the behaviour of the disc write LED is an interesting observation. When the core that's running the prng hits 100% for a second does activity to the disc stop? But then wouldn't that also be the case for ShredOS 32.

All a bit of a puzzle, however if the root cause can be figured out it would certainly be nice that see the same sort of performance in 64 bit that I'm seeing in 32 bit.

I'm rebuilding all four ShredOS variations at the moment. 64bit .img, 64bit .iso, 32 bit .img and 32 bit .iso and will release them shortly, maybe later today, it takes about 10 hours to build all four.

Perhaps I should also add the Merseene twister prng to the graph to see if there is any difference in speed between 32 & 64 bit.

As for troubleshooting this further, I'll probably install debian 64 bit and 32 bit on this hardware and rerun the tests to see if the problem exists on that distro in order to isolate it.

I've updated the graph above with zero fill, while a similar speed for MB/s irrespective of whether I use ShredOS 32 or 64, ShredOS 32 is actually marginally quicker for a zero fill than when using ShredOS 64 for the same wipe.

What's completely amazing is that any Isaac PRNG running on ShredOS 32 is only 2MB/s slower than a zero fill on ShredOS 64. At least on this processor.

Maybe it's to be expected, at least according to this thread. Do-32-bit-operating-systems-run-faster-on-64-bit-computers-than-64-bit-operating-systems

Do 32-bit operating systems run faster on 64-bit computers than 64-bit operating systems?

Short answer, yes. In general any 32 bit program runs slightly faster than a 64 bit program on a 64 bit platform, given the same CPU.
This is measurable and undeniable. For best results every executable should be recompiled for the target CPU. Yes there may be some opcodes that are only for 64 bit, but in general the substitution for 32 bit will not be much of a penalty.
You will have less utility, but that may not bother you. You may even need to run legacy 16 bit software. Which may not be possible on a 64 bit OS.

On ShredOS 64bit the LED is extinguished for about 1 second while on ShredOS 32 the LED is extinguished for less than 0.5 seconds. The LED on vs off period is 4 seconds on 1 second off and is consistently the same throughout the entire wipe.

Did you run the test with --sync 0?

You can replace the u32/u64_to_buffer() functions with memcpy() if you use fixed-width C99 stdint.h types (uint32_t and uint64_t) in the ISAAC PRNG instead of assuming long int fits the bill.

I've updated the graph above with results for the mersenne prng.

Did you run the test with --sync 0?

no, I used the default settings.

You can replace the u32/u64_to_buffer() functions with memcpy() if you use fixed-width C99 stdint.h types (uint32_t and uint64_t) in the ISAAC PRNG instead of assuming long int fits the bill.

Yes, I was also looking at the same functions to improve their efficiency, I was also wondering about the use of 'register' in the main functions. There seems to be anecdotal information that sometimes there can actually be a performance hit, while other examples show large performance increases.

I am thinking about creating an experimental build that targets i7 processors specifically to see if that improves the 64 bit performance.

Another possibility is compiling nwipe with -m32 on the 64 build and then seeing if the 64 and 32 builds then have the same performance.

Of course there is another way of theoretically having any PRNG wipe that is as fast as a zero fill on a multicore processor.

Each wipe, creates its own PRNG thread. A given wipes PRNG thread calculates two blocks of PRNG data ahead of the write, i.e the PRNG is calculating during I/O to disc rather than as it's currently implemented, which is that the PRNG causes a delay in writing to disc while the random data block is generated as it's a sequential process rather than concurrent.

I've mentioned this in previous issues

I haven't seen that code, but doesn't nwipe use one thread per drive? I can have two hard drives (one really slow HDD and another really fast HDD) and they don't appear to interfere with one anothers' speed.
Another trick is to use --sync 0, you take advantage of the OS's cache. Since it's all flushed at the end anyway, you can be sure the wipe is actually written (at least to the HDD cache buffer anyway).
The OS will steadily write to the drive, while you fill up the cache with randoms. I found that instead of the pulsing HDD light, I get a solid HDD light and much higher speed. It's especially noticeable on an older system.

I haven't seen that code, but doesn't nwipe use one thread per drive?

Yes, that's correct, so each wipe is concurrent relative other wipes, but the write & prng generation within each wipe is a sequential process. This is of little consequence when using sync 0. as the bottleneck is normally I/O so although I've not checked a sync 0 prng pass, which I will do next week, I'd expect that to be very fast as there is no fdatasync which is the only way to confirm the write was successful (in the current version of nwipe).

The issue I have with sync 0 is that no error checking is performed during the write to disc and when you have a faulty drive you get a drive MB/s that will drop away to zero but not fail with an error. However bearing in mind that using sync 0 allows for a big I/O performance boost, I started thinking about how I could use sync 0, but capture the error by some other means.

One way, which would involve minor changes to the code would be to force a failure when there has been zero bytes written to disc over a 10 second period. This would capture a faulty drive and force a failure message very quickly. The code to implement such a feature is so much simpler than using direct I/O writes combined with concurrent prng generation.

If the above change worked well on my pile of faulty drives I would be more than happy to change the default sync to 0. In this way we would then have the best of both worlds, no delays during the write and a fast response in terms of identifying a failing drive.

How does that sound?

Maybe instead of syncing, consider using a raw device? I know how to do it with BSD but I've never done this with Linux, I hear it's quite convoluted. In BSD it's there alongside the regular block device (eg. /dev/rsd0c instead of /dev/sd0c).

Apparently raw devices are removed in BSD and deprecated in Linux as Direct I/O does the same thing. Using Direct I/O would still involve creating a wipe thread that isn't slowed by the PRNG creation, which would mean the PRNG thread is buffered and available without any delay, this is doable but more complicated. I think a disk throughput timeout is a simpler mechanism while retaining the OS provided disk cache.

I'm going to release the v22 version of ShredOS first then run the above tests again with sync 0 and the new timeout code to see how the performance compares.

Apparently raw devices are removed in BSD and deprecated in Linux as Direct I/O does the same thing. Using Direct I/O would still involve creating a wipe thread that isn't slowed by the PRNG creation, which would mean the PRNG thread is buffered and available without any delay, this is doable but more complicated. I think a disk throughput timeout is a simpler mechanism while retaining the OS provided disk cache.

NetBSD, OpenBSD, DragonFlyBSD and FreeBSD all support raw devices, no sign of them being removed. Non-raw devices were removed from FreeBSD and now all device files are raw. Not sure about Linux, but I would imagine direct I/O achieves a similar effect.

I think before even worrying about PRNG performance, focus on real world performance. In reality, with some changes (eg. using uint32_t and uint64_t instead of the custom buffer copy functions) the only thing that becomes an issue is the I/O bottleneck. PRNG stream might account for 1% of the time taken and I/O might account for 99% of the time, in which case no point optimising as you'll then have to deal with thread sync issues etc.

I'm going to release the v22 version of ShredOS first then run the above tests again with sync 0 and the new timeout code to see how the performance compares.

Any idea when this will come out (32 and 64)? It's not in the shredos github yet.

Any idea when this will come out (32 and 64)? It's not in the shredos github yet.

I'm creating a draft release in github which is not publicly visible. 64bit .img has been completed, the 64 .iso and 32bit .img/.iso are being built. I'm hoping to complete this week. So depending on other commitments should be later this week.