Dask and CUDA support for first and last reductions
Closed this issue · 1 comments
first
and last
reductions are not implemented for dask or CUDA. This is because both work in parallel and we have no control over the order of running, hence up until now we have had no way of determining what is first
and last
at the point when we need to combine data from different dask partitions or different CUDA threads.
However, there is a way forward by reusing the virtual row index of the recently added where
reduction (#1164). This allows each row of a DataFrame
to know its row index, so if this information carries through the pipeline until the point at which we, for example, combine the results from separate dask partitions then first
and last
can be calculated correctly.
In pseudocode, consider that
ds.first("some_column")
is equivalent to
ds.where(ds.min(virtual_row_index), "some_column")
It isn't quite as simple as that as we need to use the where
reduction the other way round from normal, i.e. using the virtual row index to determine what values of "some_column"
to return when in fact where
so far can only use the values of "some_column"
to determine what virtual row index to return. But the existing machinery can be generalised for reuse here.
This will need #1177 implemented first for CUDA support.
Dask support is provided in PR #1214.