alexandres/terashuf

Suggestion for additional comparison benchmark

jondegenhardt opened this issue · 3 comments

Hi,

First, I just want to say that I appreciate your work towards faster single machine shuffling of large data sets. This is often useful.

Second, I wanted to point out some experiments I did to see how fast disk-based shuffling could be done with existing unix command line tools. The writeup is here: Shuffling large files, in the eBay tsv-utils github repo.

The approach used does make use of a tool in the tsv-utils repo, tsv-sample, so its not strictly traditional unix tools, but its close. The tsv-sample tool is very much like GNU shuf, but with an additional capability needed for the disk-based sampling algorithm. When used for shuffling standalone, tsv-sample reads all data into memory, like shuf.

The disk-based shuffling approach used by terashuf should be able to beat the disk-based shuffling approach I tested. It might make an interesting comparison point should you run your benchmarks again.

Hi Jon,

Sorry for the delay in getting back to you.

I ran a quick experiment and I'm not seeing any gains from sort's numeric sorting:

#!/bin/sh
set -x
FILE=random_numbers.txt
export TMPDIR=.
seq 1 100000000 | shuf > $FILE
ls -lh $FILE
time sort -S 100M < $FILE > /dev/null
time sort -k1,1nr -S 100M < $FILE > /dev/null
time shuf < $FILE > /dev/null
time MEMORY=.1  ./terashuf < $FILE > /dev/null

OUTPUT:

$ ./benchmark_sort.sh
+ FILE=random_numbers.txt
+ export TMPDIR=.
+ TMPDIR=.
+ seq 1 100000000
+ shuf
+ ls -lh random_numbers.txt
-rw-r--r-- 1 user user 848M Apr 15 13:10 random_numbers.txt
+ sort -S 100M

real    0m46.497s
user    1m13.298s
sys     0m2.605s
+ sort -k1,1nr -S 100M

real    1m14.157s
user    2m6.514s
sys     0m2.791s
+ shuf

real    0m31.794s
user    0m30.848s
sys     0m0.915s
+ MEMORY=.1
+ ./terashuf
trying to allocate 107374184 bytes

starting read
mean line-length is 7.89, estimated memory usage is 1.90 * 0.10 GB = 0.19 GB
Tip: If you would like use exactly 0.10 GB of memory, use MEMORY=0.0526 ./terashuf ...
lines read: 100000000, gb read: 0
Read 100000000 lines, 888888898 bytes, have 9 tmp files

starting write to output
lines written: 100000000, gb written: 0
done

real    0m19.837s
user    0m18.514s
sys     0m1.137s

Can you benchmark terashuf in your own scripts to confirm it is in fact faster than sort? Given it is O(N) vs O(N log N), I would be very surprised if it was slower.

Thanks for getting back to me. It'll be a couple days before I can try Terashuf in my environment, but I can take a look at it.

I need to take a closer look at your scripts to see if they are measuring the same idea. I understand what you are trying to do, which is to find a test that doesn't take multiple hours to run to make a comparison. But I need to look more closely to see if the scripts as written are making the same comparison.

One specific - The idea behind the numeric sort is not about numeric sort running faster than plain sort. It's more about it running faster than sort --random-sort. (sort --random-sort is one of the common comparison points for disk-based shuffling.) Using numeric sort as the second stage in the pipeline is necessity in the scheme because of the way random numeric values are attached to the file being shuffled. The approach I used could be done with Awk as well as tsv-sample, I should add that as an example and benchmark. That'd be good because then there would be a pure Unix tool comparison point.

But I do expect Terashuf to be faster.

Let me know how the test goes!

I just pushed a minor modification that lets you set the SKIP env var that tells terashuf to skip shuffling the first SKIP lines of a file. Set to 1 to skip header of tsv file.

Can be done with some clever bash scripting as well but thought it wouldn't hurt to have this builtin.