Performance Measurement
petergsnm opened this issue · 17 comments
I was measuring the performance of libuinet on a KVM VM (which uses virtio) with one core. I performed two tests -
- I first ran a simple TCP sever which accepts the connections and close the connections. When I run the sample server, I get around 200k PPS and around 25K Connections Per Seconds.
Next, - I modified my TCP server to make use of libuinet and when I run the program, I got around 40K PPS and 5K CPS.
I am not sure why the performance get reduced when I run with libuinet. I am planning to run callgrind and see where in the libuinet we take time. But, I expected the libuinet will get me better performance.
I can't attach my simple TCP programs here. But, if you can drop me an email at peter.gsnm@gmail.com, where I can share my sample programs.
Other thing is to get the 200K PPS (without libuinet), I made the sysctl changes as follows. Not sure if I need to modify some of your header files for the below mentioned sysctl changes. Can you please point me to where I need to modify.
fs.file-max = 5000000
net.core.netdev_max_backlog = 4000000
net.core.optmem_max = 10000000
net.core.rmem_default = 10000000
net.core.rmem_max = 3000000
net.core.somaxconn = 10000000
net.core.wmem_default = 10000000
net.core.wmem_max = 30000000
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_congestion_control = bic
net.ipv4.tcp_ecn = 0
net.ipv4.tcp_max_syn_backlog = 65000
net.ipv4.tcp_max_tw_buckets = 6000000
net.ipv4.tcp_mem = 30000000 30000000 30000000
net.ipv4.tcp_rmem = 30000000 30000000 30000000
net.ipv4.tcp_sack = 1
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_timestamps = 1
net.ipv4.tcp_wmem = 30000000 30000000 30000000
net.ipv4.tcp_early_retrans = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2
net.ipv4.tcp_fin_timeout = 7
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_window_scaling = 1
net.ipv4.tcp_low_latency = 1
Please let me know what you think.
Thank you.
~Peter
Are you using the default make file settings, which compiles libuinet with no optimization at all, or have you modified them?
I have not yet modified them. I am going to try them next.
- I am going to try them with compiler optimization to see the result.
- I am going to run the calgrind to see where the time is taken inside libuinet.
Can you also please suggest if there is some other optimization which I can try to get the better numbers? What about the equivalent sysctl changes.
Also, in the above program, my server runs with libuinet and netmap in a VM and my clients are on the base machine.
Thanks...
~Peter
Also, if I increase a the client requests, I see the accept failing . The numbers printed here are from my test program which prints the number of connections are handled in that one second.
this 1 sec : connections 3379
this 1 sec : connections 3254
this 1 sec : connections 3140
this 1 sec : connections 3173
this 1 sec : connections 3232
this 1 sec : connections 3209
accept failed (53)
this 1 sec : connections 3028
accept failed (53)
accept failed (53)
this 1 sec : connections 3170
this 1 sec : connections 3074
accept failed (53)
accept failed (53)
this 1 sec : connections 3031
accept failed (53)
accept failed (53)
accept failed (53)
I think it is because of slow pick. In this case how can we increase the queue size?
I am not seeing the accept failed errors after I creased the value of MAXCON in sys/sys/socket.h.
But, still my CPS does not go beyond 5K in a single core VM.
now, I have compiled libuinet and sample application with -O3 flag and running with nice (19), and the maximum CPS I was able to achieve is 10K connections. which is less than when compared with kernel space TCP/IP application.
top output:
top - 05:34:28 up 38 min, 3 users, load average: 1.30, 1.16, 1.04
Threads: 104 total, 2 running, 102 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.3 us, 15.5 sy, 36.1 ni, 46.7 id, 0.0 wa, 0.0 hi, 1.4 si, 0.0 st
KiB Mem: 8177624 total, 1401420 used, 6776204 free, 17636 buffers
KiB Swap: 303100 total, 0 used, 303100 free, 141176 cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2205 root 39 19 1482m 769m 1896 S 36.5 9.6 3:12.49 nm_rx: netmap0
2206 root 39 19 1482m 769m 1896 S 10.0 9.6 0:49.71 server
2204 root 39 19 1482m 769m 1896 S 6.6 9.6 0:35.17 nm_tx: netmap0
My observation is that nm_tx thread has reduced the CPU usage in -O3 run, while nm_rx thread did not.
I also printed various constants used in this program, I observed that even is maxsocket number is good, maxfiles number is very low.
uinet starting: cpus=1, nmbclusters=262144
callwheelsize=524288
callwheelsize=524288
link_elf_lookup_symbol: missing symbol hash table
link_elf_lookup_symbol: missing symbol hash table
UINET multiprocessor subsystem configured with 1 CPUs
Timecounters tick every 10.000 msec
maxusers=1
maxfiles=72
maxsockets=262144
nmbclusters=262144
tcp_recvspace=65536
tcp_finwait2_timeout=6000
tcp_fast_finwait2_recycle=0
tcp_recvspace=65536
configstr is eth1
netmap0: Ethernet address: 08:00:27:96:e4:88
Peter,
Thank you for all of the detailed information on what results you are getting and how you are getting them. I am really busy, but I am working my way towards reproducing what you are seeing and will get back to you.
In the meantime, one thing you could try if you are up to it is batch-processing accepts in accept_cb. You can look at accept_cb() in bin/passive.c for an example. If you ignore all of the references to peer sockets and connections there, I think the structure is pretty straightforward to transfer to your test program. Batching accepts should reduce the total event loop overhead under high connection rates.
Hi Patrick,
Doing batch processing increased the number by around 500 to 1K. We also printed that in one batch we were processing max 30 to 40.
From the callgrind analysis we figured out that we are taking considerable time when the server closes the connection. We thought let the server not close the connection immediately and see if the performance improves. But, looks like we can't keep more than 65K con-current connections. Not sure if this is limited by some constants/defines. Do you remember what can we change to increase con-current connection limits.
I have sent you the callgrind screen shot, through email which will help you in where the time is being spent.
Please let us know.
Thanks, Peter
What command line are you using on the server side, and what are you using to drive traffic? The first thing I am thinking of given the apparent 65k limit is exhaustion of the 16-bit port space on the client side.
The only limit on the libuinet side should be the maximum number of sockets configured via the second parameter to uinet_init. This limit is really an upper bound of the size of the pool used for socket context - making it a huge number at init time will not result in any immediate additional allocation, it will just allow the pool to grow that large if required during operation. If the issue is that you are hitting the limit due to connections being in time-wait, increasing that parameter should relieve the issue.
libuinet has been tested with up to 1 million concurrent listen sockets plus 1 million concurrent active sockets, which requires a suitably large value for the second parameter to uinet_init(), and also a suitable multiplicity of available {server_IP, server_port, client_IP, client_port} tuples.
I was using one client machine, I think which is running out of ports. I will use multiple clients and let you know.
On the second front, the max sockets is set to 262144. The other parameters is mentioned below.
Also, I have sent you the callgrind o/p in email.
uinet starting: cpus=1, nmbclusters=262144
callwheelsize=524288
callwheelsize=524288
link_elf_lookup_symbol: missing symbol hash table
link_elf_lookup_symbol: missing symbol hash table
UINET multiprocessor subsystem configured with 1 CPUs
Timecounters tick every 10.000 msec
maxusers=1
maxfiles=72
maxsockets=262144
nmbclusters=262144
tcp_recvspace=65536
tcp_finwait2_timeout=6000
tcp_fast_finwait2_recycle=0
tcp_recvspace=65536
configstr is eth1
netmap0: Ethernet address: 08:00:27:96:e4:88
OK. To answer an earlier questions of yours regarding the small value of maxfiles, don't worry about that. The maxfiles parameter exists as part of the FreeBSD common kernel infrastructure that is in libuinet, but libuinet makes no use of it - there is no emulation or use of kernel file descriptors at all in libuinet.
Thanks. Please let me know what you find from the callgrind output. I am going to try to find the CPS without closing the accepted connections (as "soclose" was taking significant CPU cycle, as shown in the callgrind output). I am also going to replace arc4random with a simple static variable for the random number generation to save the time from arc4random. With these two let me see how much CPS can I get. I am just trying to figure out the places where we need to do some optimization.
I know you are busy for your presentation tomorrow. So, please see when you have time. I will keep you updated on my progress.
Thanks...Peter
I tried to see with out closing the socket what is the CPS i can achieve, It was around 18K connection per second. When compared with the open and close it is +7K sessions.
I would like try by disabling syncache. Could you please let me know if i can give it a try by disabling syncache ?
Thanks
I am getting closer to the point where I can spend a little time digging into this. It is interesting that the close reduces performance so significantly. Until I can reproduce this on my end and have something more concrete to comment on, here are a couple of things that I think frame the issue:
It is a known issue that FreeBSD performance is currently lagging in the area of short-lived connections - see http://www.freebsd.org/cgi/query-pr.cgi?pr=183659. This doesn't mean further tuning and application-side work won't improve the numbers you are seeing, but I think it does set expectations for how high the numbers might go.
libuinet itself is just entering the phase where performance will be analyzed and improved. One of the things that really needs to happen ahead of this work is updating the stack sources libuinet is using to something considerably more recent than the 9.1-RELEASE version it currently uses. Not only do we want to avoid measuring and 'fixing' issues that no longer exist due to subsequent improvements in the main line sources, but in cases where the libuinet work indicates there could be general improvements made to the stack itself, we want to avoid the work of then reproducing the issue with more current sources and developing equivalent patches for submission.
Thank you.
Please let me know, once you finish the integration. I can do the testing for you and help you identifying the few jerks (if any). I have also integrated a small webserver with libuinet to measure the RPS and CPS and have a KVM-VM handly to measure the performance.
Looking forward to hear from you.
Is there a time frame for the migration away from 9.1-RELEASE?
Also, I wonder if the user land stack will lose any benefits from checksum offloading which the kernel stack running on a physical box can enjoy (I understand petergsnm's tests were done on KVM).
See issue #11 for information on libuinet
's current inability to make use of checksum offloading due to deficiencies of netmap
. And yes, if running in a VM or on hardware which doesn't preserve or provide checksum offloading, then the stack will need to do checksum offloading.
I cannt compile it in linux!
Should I set any environment variable ?
the error as flows:
uinet_if_netmap_host.c:331:71: error: ‘struct ifreq’ declared inside parameter list [-Werror]
uinet_if_netmap_host.c:331:71: error: its scope is only this definition or declaration, which is probably not what you want [-Werror]
uinet_if_netmap_host.c: In function ‘if_netmap_ethtool_set_flag’:
uinet_if_netmap_host.c:335:5: error: dereferencing pointer to incomplete type
uinet_if_netmap_host.c: At top level:
uinet_if_netmap_host.c:364:75: error: ‘struct ifreq’ declared inside parameter list [-Werror]
uinet_if_netmap_host.c: In function ‘if_netmap_ethtool_set_discrete’:
uinet_if_netmap_host.c:368:5: error: dereferencing pointer to incomplete type
uinet_if_netmap_host.c: In function ‘if_netmap_set_offload’:
uinet_if_netmap_host.c:396:15: error: storage size of ‘ifr’ isn’t known
uinet_if_netmap_host.c:396:15: error: unused variable ‘ifr’ [-Werror=unused-variable]
uinet_if_netmap_host.c: In function ‘if_netmap_set_promisc’:
uinet_if_netmap_host.c:447:15: error: storage size of ‘ifr’ isn’t known
uinet_if_netmap_host.c:469:19: error: ‘IFF_PROMISC’ undeclared (first use in this function)
uinet_if_netmap_host.c:469:19: note: each undeclared identifier is reported only once for each function it appears in
uinet_if_netmap_host.c:447:15: error: unused variable ‘ifr’ [-Werror=unused-variable]
Please don't piggyback on existing unrelated issues.
Open a new issue for this and include necessary context for interpreting your problem, such as the specific Linux distribution and version you are using, whether you are using something other than the stock compiler for that distribution, the command you executed and the directory you executed it in.