tromp/cuckoo

Additional CPU threads occupy significant memory space

kachind opened this issue · 13 comments

When running grin-miner with the c31 plugin, 1 thread will use ~11 GB of memory as expected. However, running 64 threads consumes 15.5 GB, and much beyond that will cause a memory allocation crash.

This system is a Xeon Phi 7210, which has 256 logical cores and 16 GB of high-bandwidth memory. Is a ~70 MB allocation per thread actually necessary? Are there any mitigating actions that could be taken here?

tromp commented

Each thread needs an additional 42 MB, which is mostly sizeof(yzbucket).
I don't know why yours would have 70 MB per thread.

Thanks, that's helpful. How do CUDA / OpenCL implementations avoid multiple 42 MB allocations?

I've noticed some other bizarre behavior with high thread counts, particularly that around 32 threads the GPS stops increasing. If I start two 32-thread instances, the GPS is roughly double the GPS of one 64 thread instance. I can open a new issue for that if you like.

tromp commented

You have limited memory bandwidth, so once exhausted, your additional threads just slow things down with increased contention for the limited bandwidth.

That doesn't explain why two separate instances break the bottleneck, though? It's the same number of total threads.

The Xeon Phi has approximately 500 GiB/sec of memory bandwidth to the MCDRAM so I'm skeptical that's the issue.

tromp commented

That's sequential bandwidth. But your accesses are not all that sequential, causing a lot of memory latency as well. So I think of it as exhausting the latency bandwidth.

Does this AVX implementation produce much less sequential memory access than the OpenCL/CUDA implementations do? Otherwise I would expect GPUs to suffer from the same bandwidth issues, but they clearly are achieving significantly better performance.

Sorry for all the questions, I'm not familiar with the details of this algorithm.

500 GiB/sec of memory bandwidth

Bandwidth numbers are completely misleading unless you adjust for CAS latency. You can only get 500 GiB/s writing straight down DRAM, with zero random access. If the implementation of Cuckoo Cycle is caching partition writes into blocks of 16 edges at a time, then typical DDR4-3000 will slow down more than 50%. Using MCDRAM instead of DDR DIMMs does not help with latency at all.

That being said, I find it dubious that running two solvers simultaneously would help this situation, unless they are writing to the same pages in RAM? If you allocate separate memory for each solver, this would only exacerbate the latency problem, and something else must be going on.

Otherwise I would expect GPUs to suffer from the same bandwidth issues

They absolutely do suffer from this problem. Nvidia GPU's can only batch up to 32 words of 32-bits each. GDDR5 has a much higher transfer rate than DDR4, however, and GPU's compute siphash way faster than a CPU. 2^29 Siphashes on a 1080Ti takes about 30-35ms. I don't have a Xeon Phi but I'm guessing it takes seconds.

tromp commented

siphashes on a Xeon Phi would benefit from avx512 instructions, but that's not supported in my solvers, and would still not make the Phi competitive with GPUs.

Yeah, here are the test details. I have to use the c29 algorithm, since otherwise I run out of memory.

(figures are plus or minus a couple of percent, since the solve time varies slightly between graphs)


[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 64

GPS: 0.24


[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 32
device = 0

[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 32
device = 1

Device 1 GPS: 0.18
Device 2 GPS: 0.18


I was hoping for a simple solution but it sounds like that's unlikely.

In your opinion, if the the CUDA implementation could be ported over to OpenMP or similar, the performance would still be poor? The Xeon Phi KNL architecture has L1 and L2 caches, which I had hoped might mitigate the HBM2 access latency, but if all of the accesses are totally unpredictable/random then it's obviously hopeless, and there's no practical motivation to fix this issue.

As an aside... KNL does support a family of unique 512-bit prefetch instructions as part of AVX512-PF, and it's deep OoO with 4 threads per physical core.

https://www.felixcloutier.com/x86/vgatherpf1dps:vgatherpf1qps:vgatherpf1dpd:vgatherpf1qpd

This would permit a thread to assemble a full cache line from non-sequential data, provided it's all located in the same 4 KiB page, and the processor would not stall since it could move to a different hyperthread. Is that relevant here? Is data generally dispersed within pages, between pages, or both?

tromp commented

So the Phi performs worse than an i7 on the avx2 solver and if a Phi expert were to devote significant effort on optimizing the solver, they might be able to quadruple the performance and it would still fall far short of a GPU...

Fair enough, closing issue.