gplv2/haproxy-postgresql

could not receive data from client: Connection reset by peer

Closed this issue · 12 comments

Hi
The haproxy configuration is working very well but on my standby node I got this in log every 5 seconds: pgc@template1 LOG: could not receive data from client: Connection reset by peer. It's probably because the termination packet is not send to the standby node.

Do you know a workaround to get rid of this ?

Thank you

gplv2 commented

Hi,

I know it's working well (in production here too) :)

But that one has been annoying me too for months now. I've made a few attempts to find this, including sniffing network traffic, doublechecking the test check code vs. the pgsql specifications.

afaik, the termination packet IS sent to the nodes ( send-binary 58 ). You can see this using wireshark/tcpdump . What I do find interesting in your case, I have it on both nodes. Not only on the slave, I have it on the master and the slave on several installations. This is intriguing.

Currently this is just an annoying fact, I kinda hate it having the logs filled up with this. What I suspect since my last attempt to find the cause of this is that the haproxy check doesn't "read" the whole network packet that returns from the pg_is_in_recovery query. So there might be stuff left in a buffer somewhere.

Is it possible to record a pcap format file with wireshark of your slave node and link it here in this issue so I can download it ? wireshark understand pgsql protocol

Could you also give me your version numbers (psql server) and a bit of background information on your architecture in relation with haproxy.

Thanks for the feedback.

gplv2 commented

The way I use it

Image of HA with HA

Hi,
I'm on debain 9, HAProxy 1.8.14, Postgresql 10.6, Repmgr 4.2. I'm also using it the same way as you (without bgbouncer but it's the next step).

pcap.zip

Thank you

I also noticed that:
haproxy_stats

I don't understand why it failed at step 17. It's not supposed to failed at step 31 with a "t" instead of a "f" ?

gplv2 commented

That 'T' is normal .. it's explained in the example config. It's the result of :

select pg_is_in_recovery()

"f" means node in master mode
"t" means node in standby mode (read-only)

The idea is to exclude any node that is running as a slave:

tcp-check expect binary 66                     # column data, "f"

So if a server is in recovery, it will return to be T(rue), hence you exclude it.

If you want to do the reverse, e.g. loadbalance between multiple slave, you should change the test into:

tcp-check expect binary 74                     # column data, "t"           (  1 byte  )

And remove the ACL nbsrv(backend_pg) eq 1 , that way you can balance between multiple slaves. (on a second backend of course)

I understand that ! Maybe I didn't explained correctly what I meant. The step 17 of the TCP-Check expect a binary value of 54 (in text it a 'T' uppercase).

tcp-check expect binary 54 # row description packet

haproxy is expecting to receive a 'T' (binary 54) as a row description packet but receive something else (or nothing because it timeout).

Is it more clearly explain ? English is not my main language.

gplv2 commented

Oh, you mean this :

tcp-check expect binary 54 # row description packet (1 byte)

the row description byte.

This could very well be a change in protocol. I have only tested this against 9.5 psql. and you're on Postgresql 10.6 I read. I wonder if you could test this against a 9.5 .

I did misunderstand all that indeed. this has nothing to do with the lowercase t/f after the query. it's before it.

PS: I'm also a non-native english speaker.

After analysing traffic on both nodes, I came to the conclusion that in my case, the termination packet is not sent to the salve node.

So, I moved this after the "select pg_is_in_recovery();":

# write: terminate session
tcp-check send-binary 58                       # Termination packet
tcp-check send-binary 00000004                 # packet length: 4 (no body)

Now it looks like this:
...

# write: run simple query
# "select pg_is_in_recovery();"
#
tcp-check send-binary 51                       # simple query
tcp-check send-binary 00000020                 # packet length: 32 bytes in (  4 bytes )
tcp-check send-binary 73656c65637420           # "select "                  (  7 bytes )
tcp-check send-binary 70675f69735f696e5f7265636f7665727928293b
                                                   # "pg_is_in_recovery();"     ( 20 bytes )
tcp-check send-binary 00 # terminator                                       (  1 byte  )
                                                   ## TOTAL                     ( 32 bytes )

# write: terminate session
tcp-check send-binary 58                       # Termination packet
tcp-check send-binary 00000004                 # packet length: 4 (no body)

# expect: Row description packet
#
tcp-check expect binary 54                     # row description packet

...
Haproxy stats page is now showing the slave node failed at step 33 (which is the right step instead of 17).
step33

I will test this way for a couple of days on my dev environment. Let me know your impression about this modification.

I forgot to mention that the message pgc@template1 LOG: could not receive data from client: Connection reset by peer is no longer logged on my slave node with the modification above.

gplv2 commented

Man, this is awesome work. Thanks. I'm going to audit it here too. Great stuff. Sorry for the delay , I don't seem to get notified of your comments. If you want to mail me directly, do so at glenn_AT_bitless_DOT_be

gplv2 commented

Thanks @ryno83 , I pushed your changes in the master branch. This effectively also solves the problem I had, both haproxy check and the connection reset message in the logs are fixed by this. Appreciate your contributions.

I think we can consider closing this , let me know if you agree.

I agree ! Thank you