h2oai/datatable

Implement sorting of columns with >2B rows

Opened this issue · 7 comments

Currently the sort() function works with int32 indices only, which means we cannot sort columns that have more than 2B rows. This is a major obstacle for supporting really big data datasets.

@st-pasha @oleksiyskononenko

Groupby sets a hard limit like this XAssert(nrows <= Column::MAX_ARR32_SIZE) in src/core/groupby.cc:58. Sorry if it's a naive question, is the scope of this task changing this limit?

@pradkrish Yes, that's correct.

@st-pasha Thanks, may I work on this issue? Where should the new limit be set?

@pradkrish You are certainly welcome to try, but don't expect this to be an easy challenge.

The limit of 2B exists because the RowIndex that is created by sort() is of type int32; in order to allow sorting larger arrays we would need to template-ize sort so that it can create int64 outputs as well.
A sibling issue to this one is #2335 (sort virtual columns).

I've already did some work on this in PR #2356, but I don't quite remember how far from the finish line it was.

@st-pasha I will be happy to try (with some guidance). Even if I don't succeed, it will help me to learn a bit more about the design of the system.

This is what I have done so far without any changes to the code. I fread a jay file with 3B (B for billion) rows and 1 column from my disk as follows

from datatable import dt, f
DT = dt.fread("data.jay")
DT.sort(0)

The file reading was successful but it failed in sort with the following error: MemoryError: Unable to allocate memory of size 12000000000[errno 12] Cannot allocate memory. The stype of my column is int32 and there are 3B elements and hence 12GB.

Back tracing the function call that resulted in this error is container_o.ensure_size(n * sizeof(int32_t)) in src/core/sort.cc:486. This happens during the construction of the sort context, even before the call to start_sort(..). Correct me if I am wrong but is this related to anything you said above because ensure_size is trying to allocate 12GB of memory block and it fails obviously because it exceeds my RAM size. any feedback?? Thanks.

So, in PR #2356 a replacement sort function is introduced (located in src/sort/*), but I think it may not work for all column types just yet. In order to enable this function you'd set the global option dt.options.sort.new = True.

As for the memory error, the issue here is that the sorter needs RAM to create the output ordering vector, plus it also needs additional memory for intermediate structures. If the overall RAM size is small then this could present a problem... In theory, we could try to create and mmap a temporary file in this case, but it hasn't been implemented yet.

@st-pasha After setting the global options, it failed again with the same error while trying to to set a buffer for row indices at src/core/sort/sorter.h:70. My RAM size is 8GB.

It appears to me that we may have to create a subtask to address the memory allocation problems for row indices and this subtask need be solved first before we can address sorting>2B rows. What do you think?