zeromq/clrzmq

CPU Usage for ZmqSocket.Receive. 9.8% libzmq (C) usage vs 80% clrzmq(C#)

Closed this issue · 18 comments

Hello,
I am running performance test on our application which uses clrzmq. It seems ZmqSocket.send and receive are very expensive compared to libzmq.dll.

Below is the breakdown for ZmqSocket.Receive method

SpinWait:17.1%
Stopwatch.GetElapsedDateTimeTicks: 4.1%
Stopwatch.StartNew: 2.4%
Receive: 73.3%

Now CPU usage for Receive (73.3) function
ErrorProxy.get_ShouldTryAgain: 5.1%
SocketProxy.Receive: 64.4%

CPU USage for SocketProxy.Receive()
DisposableIntPtr.Dispose: 11.1%
ZmqMsgT.Init:7.1%
ZmqMsgT.Close:5.8%
SocketProxy.RetryIfInterrupted: 20.8%

CPU Usage for SocketProxy.RetryIfInterrupted:
SocketProxty+<>c__DisplatClassa. 14.4% in which 9.8% is LibZmq.ZmqMsgRecvProc.Invoke

I do have a graph of these calls (png file) if anyone interested. There is no option to attach that file here.

I have forked clrzmq at https://github.com/rohitjoshi/clrzmq and modified the code to use zmq_send and zmq_recv function to send/receive raw byte buffer which has improved performance drastically. See below screenshot with improved performance.

https://docs.google.com/open?id=0Bz_UeFnokXJVekI4OEhNN3pUR1E

I would prefer if we can merge so we don't have to maintain two repository.

jgoz commented

@rohitjoshi

Sorry for the long delay on this. I did not receive any email notifications from GitHub when you created this issue.

It appears that you have removed your fork - are you still experiencing CPU usage issues you outlined above? If so, I would encourage you to open a pull request with your fix so that we can review it and merge it in, if appropriate.

Once again, sorry for the delay.

Here is forked repo: https://github.com/rohitjoshi/clrzmq-optimized

Below page explains the different of CPU utilization
https://github.com/rohitjoshi/clrzmq-optimized/wiki

The problem with using zmq message is too many PInvoke calls so I replaced with functions which takes raw buffer.

jgoz commented

I really like the approach you have taken. I considered something similar in the past, but dropped it to simplify the implementation. Now that you've demonstrated the performance benefits, I would be happy to merge your changes into the master branch.

For messages longer than 8192 bytes, instead of splitting them into multiple parts I will fall back to a zmq_msg_t-based approach. This will result in more predictable behaviour from the client's perspective. It would be trivial for users to write ZmqSocket extension methods to automatically split long messages into multiple parts.

What are your thoughts on that?

jgoz commented

On second thought, my approach described above will not work. It is impossible to know ahead of time on the Receive side whether to receive a buffer or a message.

I think the best compromise is to expose SendBuffer/ReceiveBuffer methods that use the corresponding buffer related methods directly. That way, client applications can still take advantage of the performance benefits on low powered devices while the binding layer preserves compatibility with other implementations.

jgoz commented

@rohitjoshi Ok, I think I came up with a solution that will work for low powered devices while preserving the expected behaviour for Send & Receive methods. It also avoids new Send/Receive overloads.

If you use the Send(byte[] buffer, int size, SocketFlags flags) or Receive(byte[] buffer, SocketFlags flags) methods, it will use the corresponding zmq_buffer_ method internally if the supplied buffer size is less than or equal to ZmqSocket.MaxBufferSize (8192). If the supplied buffer is larger than this, it will default to the regular zmq_msg based methods.

Please test this on your system and let me know if you see the same performance benefits. If not, re-open this issue and we'll come up with a better solution.

I am glad you included the performance changes request. As I don't want to fork/maintain a version just for my company. I will try this out and let you know. Most probably it would be next week.

I may have opened a duplicate, but using SocketFlags.DontWait with the out int size overload results in CPU mayhem. I understand there is a precondition around max buffer size, but why did the old bindings not have the CPU symptoms for the equivalent code?

// Ruh-oh
int size;
var buffer = new byte[0];
while ((buffer = _socket.Receive(buffer, SocketFlags.DontWait, out size)) == null && !_stop)
{
    Thread.Sleep(10);
}
return buffer;
// All good
byte[] buffer;
while ((buffer = _socket.Recv(SendRecvOpt.NOBLOCK)) == null && !_stop)
{
    Thread.Sleep(10);
}
return buffer;
jgoz commented

@danielcrenna Should the All good and Ruh-oh comments be swapped? Your comment implies the opposite of the code blocks.

Also, are you testing both examples with the same version of libzmq (2.x)? Both versions result in a direct zmq_recv[_msg] call (pretty much).

Also, what happens if you pass the 10 millisecond timeout to the receive method instead of calling Thread.Sleep?

E.g.

byte[] buffer;
do
{
    buffer = _socket.Receive(buffer, Timeout.FromMilliseconds(10), out size);
} while (buffer == null && !_stop);

You're right about the comments, I swapped them. I have confirmed that the same libzmq .x version is being used.

Using your overload CPU usage decreased to ~50% usage which is still heavily utilized for code accepting no messages, but it's a starting point for further testing.

jgoz commented

Cool, thanks. Is this with the latest clrzmq build from the master branch?

Yes, it is. I cloned it into our repo.

Worth considering is that the performance characteristics of DontWait vs. blocking are large enough in difference that it might be worth just cooking a "stop" message rather than trying to wrestle with it, since the locking version consumes almost no resources.

jgoz commented

K. I will investigate it. Off the top of my head, it might be due to extra zmq_msg_t initializations. clrzmq 2.x used a single message object per socket whereas the new version allocates a new message for each send/receive operation.

Can you try one last rewrite?

var buffer = new byte[ZmqSocket.MaxBufferSize];
int receivedBytes;

do
{
    receivedBytes = socket.Receive(buffer, TimeSpan.FromMilliseconds(10));
} while (receivedBytes < 0 && !_stop);

This uses zmq_recv under the hood, which doesn't involve allocating zmq_msg_t structures. The downside is that anything beyond ZmqSocket.MaxBufferSize (8192) bytes will get truncated, but it results in fewer P/Invoke calls and, thus, lower CPU usage.

I will try that, but I actually figured out that the problem, for me, is that SocketProxy is giving back an empty byte[] no matter what, and my repurposed loop code was checking for a null byte[] buffer. The end result is a tight loop that is the cause of the CPU churn, not the library or the bindings.

Basically, between the two bindings, the behavior has changed such that I used to expect a null buffer, and now I get an empty one, if that makes sense. Up to you if you want to call attention to the subtlety or not.

Snippets for posterity:

// New bindings always return a buffer, so use a length check instead or you'll get a tight loop
int size;
var buffer = new byte[0];
while ((buffer = _socket.Receive(buffer, SocketFlags.DontWait, out size)).Length == 0 && !_stop)
{
    Thread.Sleep(10);
}
return buffer;
// Old bindings send back a null byte[] if nothing is received
byte[] buffer;
while ((buffer = _socket.Recv(SendRecvOpt.NOBLOCK)) == null && !_stop)
{
    Thread.Sleep(10);
}
return buffer;

I am evaluating whether to adopt ZMQ, XS, by extension into .NET C# world CLRZMQ, and oh-by-the-way, must also cross compile to ArchLinux ARM. Re: the "Ruh-oh" comments above, how does something like a return byte[] and out size get overlooked? That's seems fundamental enough with any C# .NET library API. It's completely unnecessary, IMO, when callers can simply check buffer==null||buffer.Length==0, or conversely buffer!=null&&buffer.Length>0, and so on. And then, write yourself an extension method if buffer nullness is a concern: public static T[] ToSelfOrArray(this T[] arr) { return arr ?? new T[0]; } to help guarantee buffer has something after the call. Simple. I also tend to prefer IEnumerable to T[] anyway. It extends better into LINQ providers. Straight CXX-style arrays tend to lead to stronger coupling downstream.