partition_copy vs copy_if/remove_copy_if
etiennemlb opened this issue · 8 comments
Hi,
I was looking that the performance of the thrust::partition_copy and found it quite slow compared to what I believe is a similar operation, that is, thrust::copy_if and thrust::remove_copy_if. In fact, partition_copy seems x2 slower.
Hardware | description |
---|---|
GPU | MI100:amdgcn-amd-amdhsa--gfx908:sramecc+:xnack- |
CPU | AMD EPYC 7542 |
Software | version |
---|---|
Distribution | Redhat |
ROCm | 4.5.0 to 5.0.2 |
The reproducer is a bit verbose but the code is straight forward:
#include <assert.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <thrust/functional.h>
#include <thrust/host_vector.h>
#include <thrust/random.h>
#include <thrust/sort.h>
#include <ctime>
#include <iterator>
template <typename Duration = std::chrono::nanoseconds>
class StopWatch {
public:
/// Nanoseconds by default
///
using duration = Duration;
using time_point = std::chrono::steady_clock::time_point;
static_assert(std::chrono::steady_clock::is_steady, "Only steady clocks (the ones that only go forward) !");
public:
/// Start the StopWatch
///
StopWatch();
/// Return the Return time difference (duration) between the last Lap() or Reset() or object construction.
/// Do not call Elapsed() and then Reset(), just call Lap() instead.
///
duration Elapsed() const;
/// Same as lap but do not return the time difference (duration).
/// Do not call Elapsed() and then Reset(), just call Lap() instead.
///
void Reset();
protected:
time_point the_last_point_in_time_;
};
////////////////////////////////////////////////////////////////////////
// StopWatch methods definition
////////////////////////////////////////////////////////////////////////
template <typename Duration>
StopWatch<Duration>::StopWatch()
: the_last_point_in_time_{std::chrono::steady_clock::now()} {
// EMPTY
}
template <typename Duration>
typename StopWatch<Duration>::duration
StopWatch<Duration>::Elapsed() const {
// The cast is a noop if std::chrono::steady_clock::duration "is same" duration. For other duration, I dunno.
return std::chrono::duration_cast<duration>(std::chrono::steady_clock::now() - the_last_point_in_time_);
}
template <typename Duration>
void StopWatch<Duration>::Reset() {
the_last_point_in_time_ = std::chrono::steady_clock::now();
}
struct KeyInfZero {
template <typename Tuple>
__host__ __device__ constexpr bool
operator()(const Tuple& a_tuple) const {
static_cast<void>(a_tuple);
return thrust::get<0>(a_tuple) < 0;
}
};
template <typename T>
void initialize_keys(thrust::device_vector<T>& keys) {
thrust::default_random_engine rng;
thrust::uniform_int_distribution<int> dist(0, keys.size());
thrust::host_vector<T> h_keys(keys.size());
for(size_t i = 0; i < h_keys.size(); i++) {
h_keys[i] = dist(rng);
}
keys = h_keys;
}
int main(void) {
static constexpr size_t N = 10'000'000; // / 10;
{
thrust::device_vector<short> values0(N);
thrust::device_vector<double> x(N);
thrust::device_vector<double> y(N);
thrust::device_vector<double> z(N);
thrust::device_vector<double> mx(N);
thrust::device_vector<double> my(N);
thrust::device_vector<double> mz(N);
thrust::device_vector<double> c(N);
thrust::device_vector<double> w(N);
thrust::device_vector<short> _values0(N);
thrust::device_vector<double> _x(N);
thrust::device_vector<double> _y(N);
thrust::device_vector<double> _z(N);
thrust::device_vector<double> _mx(N);
thrust::device_vector<double> _my(N);
thrust::device_vector<double> _mz(N);
thrust::device_vector<double> _c(N);
thrust::device_vector<double> _w(N);
initialize_keys(values0);
// thrust::sort(std::begin(values0), std::end(values0)); // Sorting the keys does not improve the partitioning speed
auto input_iterator = thrust::make_zip_iterator(thrust::make_tuple(std::begin(values0),
std::begin(x),
std::begin(y),
std::begin(z),
std::begin(mx),
std::begin(my),
std::begin(mz),
std::begin(c),
std::begin(w)));
auto output_iterator_true = thrust::make_zip_iterator(thrust::make_tuple(std::begin(_values0),
std::begin(_x),
std::begin(_y),
std::begin(_z),
std::begin(_mx),
std::begin(_my),
std::begin(_mz),
std::begin(_c),
std::begin(_w)));
// reverse iterators have no overhead in this case
auto output_iterator_false = thrust::make_zip_iterator(thrust::make_tuple(thrust::make_reverse_iterator(std::end(_values0)),
thrust::make_reverse_iterator(std::end(_x)),
thrust::make_reverse_iterator(std::end(_y)),
thrust::make_reverse_iterator(std::end(_z)),
thrust::make_reverse_iterator(std::end(_mx)),
thrust::make_reverse_iterator(std::end(_my)),
thrust::make_reverse_iterator(std::end(_mz)),
thrust::make_reverse_iterator(std::end(_c)),
thrust::make_reverse_iterator(std::end(_w))));
StopWatch<> t;
thrust::partition_copy(input_iterator, input_iterator + N, output_iterator_true, output_iterator_false, KeyInfZero{});
auto duration = 1e-6 * t.Elapsed().count();
std::cout << "partition_copy: " << duration << " ms" << std::endl;
assert(thrust::is_partitioned(output_iterator_true, output_iterator_true + N, KeyInfZero{}));
}
{
thrust::device_vector<short> values0(N);
thrust::device_vector<double> x(N);
thrust::device_vector<double> y(N);
thrust::device_vector<double> z(N);
thrust::device_vector<double> mx(N);
thrust::device_vector<double> my(N);
thrust::device_vector<double> mz(N);
thrust::device_vector<double> c(N);
thrust::device_vector<double> w(N);
thrust::device_vector<short> _values0(N);
thrust::device_vector<double> _x(N);
thrust::device_vector<double> _y(N);
thrust::device_vector<double> _z(N);
thrust::device_vector<double> _mx(N);
thrust::device_vector<double> _my(N);
thrust::device_vector<double> _mz(N);
thrust::device_vector<double> _c(N);
thrust::device_vector<double> _w(N);
initialize_keys(values0);
// thrust::sort(std::begin(values0), std::end(values0)); // Sorting the keys does not improve the partitioning speed
auto input_iterator = thrust::make_zip_iterator(thrust::make_tuple(std::begin(values0),
std::begin(x),
std::begin(y),
std::begin(z),
std::begin(mx),
std::begin(my),
std::begin(mz),
std::begin(c),
std::begin(w)));
auto output_iterator_true = thrust::make_zip_iterator(thrust::make_tuple(std::begin(_values0),
std::begin(_x),
std::begin(_y),
std::begin(_z),
std::begin(_mx),
std::begin(_my),
std::begin(_mz),
std::begin(_c),
std::begin(_w)));
// reverse iterators have no overhead in this case
auto output_iterator_false = thrust::make_zip_iterator(thrust::make_tuple(thrust::make_reverse_iterator(std::end(_values0)),
thrust::make_reverse_iterator(std::end(_x)),
thrust::make_reverse_iterator(std::end(_y)),
thrust::make_reverse_iterator(std::end(_z)),
thrust::make_reverse_iterator(std::end(_mx)),
thrust::make_reverse_iterator(std::end(_my)),
thrust::make_reverse_iterator(std::end(_mz)),
thrust::make_reverse_iterator(std::end(_c)),
thrust::make_reverse_iterator(std::end(_w))));
StopWatch<> t;
thrust::copy_if(input_iterator, input_iterator + N, output_iterator_true, KeyInfZero{});
thrust::remove_copy_if(input_iterator, input_iterator + N, output_iterator_false, KeyInfZero{});
auto duration = 1e-6 * t.Elapsed().count();
std::cout << "partition_copy with copy_if: " << duration << " ms" << std::endl;
assert(thrust::is_partitioned(output_iterator_true, output_iterator_true + N, KeyInfZero{}));
}
return 0;
}
One could even do a stable_partition_copy by using a count_if to get the last iterator of the "true" partition and not using a reverse iterator.
Hi,
You are right that these are very similar operations, so much so that they all share the same implementation in the hip backend.
Partition uses rocPRIM's partition
under the hood, copy_if
and remove_if
use rocprim::select
. rocprim::select
and rocprim::partition
are both implemented with the same parameterized kernel partition_impl
in rocPRIM (along with other things it gets quite hairy)
Interestingly, part of the problem is that rocprim::partition
has the same interface as you're emulating here with reverse iterators, i.e. it compacts the unselected items to the end of the (single) output range in reverse order. Because of this rocThrust has an additional copy in partition
to copy the range to the two iterators and reverse the unselected range.
I'm fairly sure this is what you're seeing here, it also explains the 2x performance difference, especially since the predicate is very trivial compared to the cost of copying ~70 bytes per item.
The interface that is expected by rocThrust could be made in rocPRIM to resolve this, all the pieces are already there, I think long term that is what should be done.
Until that is resolved you could use CUB / hipCUB DeviceSelect::If
or rocprim::partition
Thanks, the details are much appreciated.
Regarding rocThrust's interface, I do not understand why the reversing of the out_false range is necessary. The postcondition is just a partition, not stable partition.
@Maetveis Can you please review this issue?
Regarding rocThrust's interface, I do not understand why the reversing of the out_false range is necessary. The postcondition is just a partition, not stable partition.
It's because that's the behavior of Thrust and due to Hyrum's Law it was probably best to match that instead of strictly interpreting what the documentation actually promises.
I was not yet involved with this project when this function was implemented, but some tests could also depend on this (overly constrained) behavior by mistake.
Reversing the range does not add a considerable overhead over the extra copy anyway, so it makes some sense to emulate Thrust's behavior.
@doctorcolinsmith, should we track this possible performance improvement here, or create an issue over at rocPRIM for the interface?
We are still considering this recommendation.
On rocprim's side: partition_two_way
has been merged. I hope to get the PR for rocThrust out sometime this week or early next week.