nviennot/core-to-core-latency

[Result] AMD Ryzen 3900x

zommiommy opened this issue · 10 comments

Results on my Ruzen 3900x output.csv

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 23
model           : 113
model name      : AMD Ryzen 9 3900X 12-Core Processor
stepping        : 0
microcode       : 0x8701013
cpu MHz         : 2200.000
cache size      : 512 KB
physical id     : 0
siblings        : 24
core id         : 0
cpu cores       : 12
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 16
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
bugs            : sysret_ss_attrs spectre_v1 spectre_v2 spec_store_bypass
bogomips        : 8003.65
TLB size        : 3072 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 43 bits physical, 48 bits virtual
power management: ts ttp tm hwpstate cpb eff_freq_ro [13] [14]
Glavo commented

I noticed that the results you submitted are strange, the core latency between within the same CCX reaches more than 40ns, which should not be. This problem also occurs on my 3700x machine:

image

And in the 3960x results that have been submitted, the latency in the same CCX looks normal:

https://github.com/nviennot/core-to-core-latency#amd-ryzen-threadripper-3960x-380ghz-24-cores-zen-2-3rd-gen-2019-q4

@nviennot Can you take a look at this question?

What's the question?

Glavo commented

What's the question?

@nviennot

I have a guess that running the core-to-core-latency on some Zen 2 devices will give unreasonable results.

For my 3700x machine (and the 3900x machine of the author of this issue), the core-to-core latency within the same CCX is over 40ns:

image

image

I think this is odd because Zen 2 uses full interconnectivity inside CCX. For reference, I ran core-to-core-latency on my 5800x (ring bus based) machine and ended up with core-to-core latency all under 20ns.

As another reference, which is the 3960x reference result provided by this project, where the latency in CCX does not exceed 30ns:
https://github.com/nviennot/core-to-core-latency#amd-ryzen-threadripper-3960x-380ghz-24-cores-zen-2-3rd-gen-2019-q4

image

As last, anandtech measured the core-to-core latency of the 3950x with other software, where the latency within the CCX was also significantly lower than what we measured:

image

Based on these facts, I think it may be because core-to-core-latency cannot measure the latency of some Zen 2 CPUs correctly. What do you think about this conjecture?

The tool measures the latency correctly.

On your graphic, it says 2.2Ghz, which is suspiciously low. That would explain these high numbers.

Glavo commented

The tool measures the latency correctly.

On your graphic, it says 2.2Ghz, which is suspiciously low. That would explain these high numbers.

@nviennot

The 2.2Ghz frequency of the 3900x is just a random data I filled in, because I don't know the actual frequency of the issue author when running core-to-core-latency.

The 3700x running at 4Ghz is the real data, which is the frequency I set it to. That is, the core-to-core latency measured inside the same CCX exceeds 40ns when core frequency at 4Ghz.

I've also tried lowering the 3700x to 2.8Ghz, which results in measured intra-CCX core-to-core latency of over 60ns and cross-CCX core-to-core latency of over 100ns.

I don't know why your CPU doesn't behave the way you'd like, but I believe the tool is correct.

Glavo commented

I don't know why your CPU doesn't behave the way you'd like, but I believe the tool is correct.

@nviennot

I continued to research the issue and then a strange situation led me to determine that core-to-core-latency has some Zen 2 specific issues:

The debug build of core-to-core-latency get lower latency than release builds on Zen 2 machines.

This anomaly is only seen on Zen 2 machines. On other machines, release build always have lower latency.

I have confirmed this situation again and again. This situation exists in both Windows and Linux.

image

image

Should I open a new issue for this situation?

You can try adding a PAUSE instruction in the spin loops.
Specifically, add it on

while state.flag.compare_exchange(PING, PONG, Ordering::Relaxed, Ordering::Relaxed).is_err() {}
and Line 56.

Do this by calling std::hint::spin_loop() within the currently empty while body.

The CPU does weird things, it's the reality of it

Glavo commented

You can try adding a PAUSE instruction in the spin loops. Specifically, add it on

while state.flag.compare_exchange(PING, PONG, Ordering::Relaxed, Ordering::Relaxed).is_err() {}

and Line 56.
Do this by calling std::hint::spin_loop() within the currently empty while body.

The CPU does weird things, it's the reality of it

I've modified the solution you gave and re-run.

After making the modifications, the latency of the release build went down. However, the latency of release builds is still higher than that of debug build. Although the difference is small (difference of mean latency is about 1ns), but it is stable and reproducible.

The latency of the debug build is unchanged from before the modification.

That's an interesting problem. You can inspect the assembly to understand the difference between debug and release, and make release look like debug little by little and see what makes a change in performance

To be extra safe, you can also wrap the AtomicBool on

flag: AtomicBool,
like for
owned_by_ping: CachePadded<AtomicBool>,