improve filesort()

Question

improve filesort()

Opened this issue 7 years ago · 6 comments

clarkfitzg commented 7 years ago

Taking a look at the file based sort now, and recording observations and design notes here as I go.

Current implementation

based on bucket sort

Manager reads the first nsamp=1000 rows of file (or a sample of distributed data frame)
Manager uses quantiles to choose approximately equal sized bins based on sample
Each worker reads all the data in and keeps only those that belong to its chunk
Each worker sorts it's chunk and makes it a global variable outdfnm

Notes

Data must fit in memory on the cluster
The entire data is read k times, once for each of the k workers.

Alternative 1: Single read

data.table has high performance reading and sorting using multithreading. If the data will fit in memory then it may well be faster to read and sort in the manager, then send to the workers using distribsplit().

Alternative 2: Temporary file based

Use steps 1 and 2 in the current implementation to determine bins. Then:

Assign each worker a subset of the distributed files to read and split
Each worker reads these files in chunks, splitting the chunk and appending into it's own private set of bins. This results in the following temporary directory structure:

- worker1
    file_bin1
    file_bin2
    ...
- worker2
    file_bin1
    file_bin2
    ...
etc.

Finally, each worker is responsible for reading and sorting a subset of the bins. If there are more bins than workers then we can write the sorted files to disk without requiring that the data fit in memory. This approach requires 2 reads and 1 write, instead of k reads.

Answer 1 · 2017-04-29T18:34:17.000Z

Approach 1 can be done right now. Approach 2 offers a new capability, and is something I could use in the project I'm working on, so I'm leaning towards this one.

Answer 2 · 2017-04-29T22:02:16.000Z

As I've mentioned, there is a vast literature on disk sort, both classical and for Hadoop and Spark (where the sort is called a "shuffle"). The reason I haven't done it before is that it is quite daunting. Here as some aspects: 1. R vs. C/C++. It certainly would be nice to stick with the former, which may work well if we leverage "hidden" C/C+ (see point 2). 2. If R, then straight use of the 'parallel' library vs., say Rmpi. 3. Where is the original data, straight R file vs. SQL. 4. How does the fact that we want the process to produce a distributed file in sorted order impact any of this? Are the sorts ("shuffles") in Hadoop and Spark of relevance here? 5. What open source software can we leverage? I've always been a fan of the elegance of Hyperquicksort. My ultimate plan was to try this in conjunction with Rmpi. I just now did a quick Web search, and found https://www.cs.utah.edu/~hari/files/pubs/sc13.pdf You might take a look at it to get an idea of the issues involved. Norm

Answer 3 · 2017-05-01T16:53:56.000Z

Regarding leveraging existing stuff, I think using base R and data.table for the in memory parts of the sort will get us pretty far. Thanks for that paper link.

Answer 4 · 2017-05-17T00:11:51.000Z

Started working on this today. To monitor progress I'd like to compare implementations here using the sort benchmarks http://sortbenchmark.org/. This in turn has required passing in more arguments for read.table() and write.table() because of the messy strings. So I'm working on that.

Answer 5 · 2017-05-17T02:14:35.000Z

On Tue, May 16, 2017 at 05:11:51PM -0700, Clark Fitzgerald wrote: Started working on this today. To monitor progress I'd like to compare implementations here using the sort benchmarks http://sortbenchmark.org/.

Worth a try, but that might not be fully relevant. For instance, they allow use of disk striping, which of course we cannot do. Norm

Answer 6 · 2017-05-17T14:52:08.000Z

The sortbenchmark is more of a convenient source of data to be sorted. Then we can get a baseline on the current performance of partools for comparison as we work on it.