Suggestion for additional comparison benchmark
jondegenhardt opened this issue · 3 comments
Hi,
First, I just want to say that I appreciate your work towards faster single machine shuffling of large data sets. This is often useful.
Second, I wanted to point out some experiments I did to see how fast disk-based shuffling could be done with existing unix command line tools. The writeup is here: Shuffling large files, in the eBay tsv-utils github repo.
The approach used does make use of a tool in the tsv-utils repo, tsv-sample
, so its not strictly traditional unix tools, but its close. The tsv-sample
tool is very much like GNU shuf
, but with an additional capability needed for the disk-based sampling algorithm. When used for shuffling standalone, tsv-sample
reads all data into memory, like shuf
.
The disk-based shuffling approach used by terashuf
should be able to beat the disk-based shuffling approach I tested. It might make an interesting comparison point should you run your benchmarks again.
Hi Jon,
Sorry for the delay in getting back to you.
I ran a quick experiment and I'm not seeing any gains from sort's numeric sorting:
#!/bin/sh
set -x
FILE=random_numbers.txt
export TMPDIR=.
seq 1 100000000 | shuf > $FILE
ls -lh $FILE
time sort -S 100M < $FILE > /dev/null
time sort -k1,1nr -S 100M < $FILE > /dev/null
time shuf < $FILE > /dev/null
time MEMORY=.1 ./terashuf < $FILE > /dev/null
OUTPUT:
$ ./benchmark_sort.sh
+ FILE=random_numbers.txt
+ export TMPDIR=.
+ TMPDIR=.
+ seq 1 100000000
+ shuf
+ ls -lh random_numbers.txt
-rw-r--r-- 1 user user 848M Apr 15 13:10 random_numbers.txt
+ sort -S 100M
real 0m46.497s
user 1m13.298s
sys 0m2.605s
+ sort -k1,1nr -S 100M
real 1m14.157s
user 2m6.514s
sys 0m2.791s
+ shuf
real 0m31.794s
user 0m30.848s
sys 0m0.915s
+ MEMORY=.1
+ ./terashuf
trying to allocate 107374184 bytes
starting read
mean line-length is 7.89, estimated memory usage is 1.90 * 0.10 GB = 0.19 GB
Tip: If you would like use exactly 0.10 GB of memory, use MEMORY=0.0526 ./terashuf ...
lines read: 100000000, gb read: 0
Read 100000000 lines, 888888898 bytes, have 9 tmp files
starting write to output
lines written: 100000000, gb written: 0
done
real 0m19.837s
user 0m18.514s
sys 0m1.137s
Can you benchmark terashuf in your own scripts to confirm it is in fact faster than sort? Given it is O(N) vs O(N log N), I would be very surprised if it was slower.
Thanks for getting back to me. It'll be a couple days before I can try Terashuf in my environment, but I can take a look at it.
I need to take a closer look at your scripts to see if they are measuring the same idea. I understand what you are trying to do, which is to find a test that doesn't take multiple hours to run to make a comparison. But I need to look more closely to see if the scripts as written are making the same comparison.
One specific - The idea behind the numeric sort is not about numeric sort running faster than plain sort. It's more about it running faster than sort --random-sort
. (sort --random-sort
is one of the common comparison points for disk-based shuffling.) Using numeric sort as the second stage in the pipeline is necessity in the scheme because of the way random numeric values are attached to the file being shuffled. The approach I used could be done with Awk as well as tsv-sample
, I should add that as an example and benchmark. That'd be good because then there would be a pure Unix tool comparison point.
But I do expect Terashuf to be faster.
Let me know how the test goes!
I just pushed a minor modification that lets you set the SKIP env var that tells terashuf to skip shuffling the first SKIP lines of a file. Set to 1 to skip header of tsv file.
Can be done with some clever bash scripting as well but thought it wouldn't hurt to have this builtin.