hreinhardt/amqp

Bug in `amqp-0.5`

Opened this issue · 13 comments

When I upgrade my code from 0.4.2 to 0.5 I get the following mysterious error when connecting to a message queue:

Data.Binary.Get.runGet at position 0: demandInput: not enough bytes

I'm limited for time at the moment, so I haven't had time to narrow down this test case from within the application that this occurred in, but my best guess from studying the diff between 0.4.2 and 0.5 is that this exception probably originated in your internal readFrame function.

I think it might be the call to peekFrameSize in readFrame. Could you check if we're passing 7 bytes into peekFrameSize?

Your intuition was right: the dat value returned by BL.hGet handle 7 ends up being Empty, so peekFrameSize receives an empty ByteString/

According to the documentation 0 should only be returned on EOF. Can we rule out that the socket has been closed? You may want to look at the RabbitMQ log file.

Yes, there is an error in the RabbitMQ log file each time this happens:

=INFO REPORT==== 3-Sep-2013::12:56:08 ===
starting TCP connection <0.14297.3> from <removed>

=ERROR REPORT==== 3-Sep-2013::12:56:18 ===
exception on TCP connection <0.14297.3> from <removed>
{handshake_timeout,handshake}

=INFO REPORT==== 3-Sep-2013::12:56:18 ===
closing TCP connection <0.14297.3> from <removed>

I also forgot to mention that there was in fact a long delay before the exception occurred (~10 seconds, as you can see from the log). When I use amqp-0.4.2 the connection succeeds instantly. So it seems that something about amqp-0.5 is triggering a timeout in the hand shake which in turn causes hGet to return an empty string.

That's very strange. You may want to add some print statements to openConnection' to figure out where it fails.

The most likely explanation is that we need to call hFlush after every write to the handle. But that makes me wonder why it works on my computer.

It may be necessary to use flush, at least some clients do but I also suspect the issue may be with lazy evaluation.

@Gabriel439 the error message means that RabbitMQ did not receive initial protocol handshake (which is 7 bytes, AMQP\x00\x00\x09\x01). Please try recording what's coming from the client to the server using Wireshark.

I will do that. Just give me until Thursday as I'm busy through tomorrow.

FWIW, I can reproduce the exception by supplying a non-existing user/password combination. Sticking

closed <- hIsEOF handle
when closed $ error "connection is closed"

at the start of readFrame yields the "connection is closed" exception, but still only after a timeout of roughly 10 seconds.

I added some calls to hFlush in a patch to the master branch. Can you try if that removes the 10 second delay?

Hm, just to be sure, I did a couple of runs with and without the call to hFlush and it's now more or less consistently around 3-4 seconds. The other attempt might have been an outlier.

Looking at the server log, the connection is obviously properly closed with the wrong, compared to @Gabriel439's handshake_timeout, so in hindsight, my claim to being able to reproduce his error is wrong. Apologies.

In any case, I think it'd be better if a closed connection wouldn't manifest in error at runGet.

I agree. The exception handling needs a lot of improvement.