Mellanox/sockperf

How to read sockperf output

LuisRodriguezMSFT opened this issue · 4 comments

Hi, this is a basic question maybe but from the docs i don't see a clear answer.

I am measuring the latency between two VMs, according to my network infra i would expect to get around 2ms but percentiles are showing something different:

_sockperf ping-pong -i 10.12.0.4 --tcp -m 350 -t 200 -p 12345 --full-rtt
sockperf: == version #3.8-0.git31ee322aa82a ==
sockperf[CLIENT] send on:sockperf: using recvfrom() to block on socket(s)

[ 0] IP = 10.12.0.4 PORT = 12345 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: [Total Run] RunTime=199.999 sec; Warm up time=400 msec; SentMessages=105665; ReceivedMessages=105664
sockperf: ========= Printing statistics for Server No: 0
sockperf: [Valid Duration] RunTime=199.548 sec; SentMessages=105413; ReceivedMessages=105413
sockperf: ====> avg-rtt=1892.600 (std-dev=422.741, mean-ad=240.040, median-ad=243.961, siqr=169.591, cv=0.223, std-error=1.302, 99.0% ci=[1889.246, 1895.954])
sockperf: # dropped messages = 0; # duplicated messages = 0; # out-of-order messages = 0
sockperf: Summary: Round trip is 1892.600 usec
sockperf: Total 105413 observations; each percentile contains 1054.13 observations
sockperf: ---> MAX observation = 11541.068
sockperf: ---> percentile 99.999 = 11533.498
sockperf: ---> percentile 99.990 = 8954.485
sockperf: ---> percentile 99.900 = 7703.087
sockperf: ---> percentile 99.000 = 3031.973
sockperf: ---> percentile 90.000 = 2252.356
sockperf: ---> percentile 75.000 = 2013.136
sockperf: ---> percentile 50.000 = 1821.435
sockperf: ---> percentile 25.000 = 1673.954
sockperf: ---> MIN observation = 1270.749_

As you can see the round trip is 1892.600 usec which is ok (1.892 ms if i am not mistaken)
But from the percentiles (90.000, 99.000...) i see higher times, up to 11533.498 usec for instance.

I wanted to clarify how to read this data and know what those percentiles means.
Is that there where frames that took 11533.498 usec to be replied?
Is that something to worry about in terms of latency?

Thank you,

I wanted to clarify how to read this data and know what those percentiles means.

the 99 percentile, is defined as the value that 99 out of 100 samples fall below. Thus 99 of 100, observe a latency less than this value, and 1 in every 100 observe a latency equal to or greater.
see function for percentiles output for more details https://github.com/Mellanox/sockperf/blob/sockperf_v2/src/client.cpp#L576

Is that there where frames that took 11533.498 usec to be replied?

yes

Thanks Igor,

As avg-rtt is much lower (1892.600 usec / 1.892 msec) is that 99 percentile value something to worry about in terms of latency?
If my network requirement is to have 2ms or less as average am i safe taking in count the avg-rtt only?

I am saying this becase i tested this even on VM's within the same host (minimal latency) and whereas the avg-rtt is good i still get high values on the 99 percentile.

It will be nice to reduce abnormal values. It might be done tuning servers and configuration, binding application to correct cores etc.

Thanks Igor