question about performance of `at(...)`
Closed this issue · 4 comments
Hi, I'm trying out this library for a project where I have a performance bottleneck on a piece of code that uses a std::unordered_map
to renumber some indices:
template < typename Map >
void renumber(const std::vector< uint64_t > & vertex_ids,
std::vector< std::array< uint64_t, 4 > > elements,
int num_threads) {
bool supports_parallel_insertion =
!std::is_same< Map, std::unordered_map<uint64_t, uint64_t> >::value;
Map new_ids;
std::atomic< uint64_t > new_id = 0;
threadpool pool((supports_parallel_insertion) ? num_threads : 1);
// insert vertex ids into a map, and give them a new compact index
pool.parallel_for(vertex_ids.size(), [&](uint64_t i){
auto id = new_id++;
new_ids[vertex_ids[i]] = id;
});
// update connectivity lists to use the new indices
pool.parallel_for(elements.size(), [&](uint64_t i){
auto & elem = elements[i];
elem[0] = new_ids.at(elem[0]);
elem[1] = new_ids.at(elem[1]);
elem[2] = new_ids.at(elem[2]);
elem[3] = new_ids.at(elem[3]);
});
Of course std::unordered_map
doesn't let you insert things in parallel, so every other part of the code is benefitting from 32 threads, but this one becomes a bottleneck because of the serial insertion step. When I time these two operations, I get (first timing is insertion, second is updating indices)
std::unordered_map, (single threaded insertion, single threaded update): 1609.12ms 6296.07ms
std::unordered_map, (single threaded insertion, multithreaded update): 1815.36ms 608.119ms
So, the insertion can't benefit from multiple threads, but the update step can (~10x speedup).
When I try using
template < int n >
using pmap = phmap::parallel_node_hash_map<
uint64_t,
uint64_t,
std::hash<uint64_t>,
std::equal_to<uint64_t>,
std::allocator<std::pair<const uint64_t, uint64_t>>,
n,
std::mutex >;
as a drop in replacement for std::unordered_map<uint64_t, uint64_t>
, I get better insertion times
pmap<4>, 1 thread: 709.521ms 13461.7ms
pmap<4>, 32 threads: 617.853ms 5174.52ms
pmap<6>, 1 thread: 1216.25ms 13500.4ms
pmap<6>, 32 threads: 442.394ms 2610.72ms
pmap<9>, 32 threads: 194.461ms 1679.37ms
but considerably worse access times in the update step.
Is this expected, or am I perhaps using the wrong kind of container (or configuring them with the wrong parameters)?
The complete sourcefile for this benchmark is available here.
Thank you
Hi @samuelpmish , thank you for using phmap.
I have a fix for you. You''ll be happy with the speed:
- use
phmap::parallel_flat_hash_map
, it is faster. - create another type alias (I called it
pmap_nullmutex
) which hasphmap::NullMutex
as the mutex type - call the
renumber
function with this extra type, as in:
renumber< pmap<4>, pmap_nullmutex<4> >(vertex_ids, elements, 1);
- before the
update
part, swap thepmap
with the mutex with the one with theNullMutex
, and then use the new map with no mutex locking (which is unneeded as we access the map only for read), as in:
stopwatch.start();
Map_nomutex new_ids_nc;
new_ids_nc.swap(new_ids);
pool.parallel_for(elements.size(), [&](uint64_t i){
auto & elem = elements[i];
elem[0] = new_ids_nc.at(elem[0]);
elem[1] = new_ids_nc.at(elem[1]);
elem[2] = new_ids_nc.at(elem[2]);
elem[3] = new_ids_nc.at(elem[3]);
});
stopwatch.stop();
Here are the results I got:
generating dataset .. done
std::unordered_map, 1 thread: 1073.95ms 4651.66ms
std::unordered_map, 32 thread (single threaded insertion): 1253.99ms 464.183ms
pmap4, 1 thread: 649.7ms 2574.57ms
pmap4, 32 threads: 353.275ms 217.594ms
pmap6, 1 thread: 365.098ms 2615.59ms
pmap6, 32 threads: 204.003ms 219.373ms
Here is your updated benchmark
PS: would you mind if I add it as an example in phmap?
PS2: your benchmark feels very familiar. I used to develop a graphics lbrary for FE element analysis at Altair Engineering :-).
thanks for the update, that was what I was missing!
would you mind if I add it as an example in phmap?
sounds good to me
your benchmark feels very familiar. I used to develop a graphics lbrary for FE element analysis at Altair Engineering :-)
you've figured me out, it's taken from a finite element project-- thanks again for the help!
Oh, I just thought that your bench will be a bit faster if you reserve the size in your map:
new_ids.reserve(vertex_ids.size() * 110 / 100);