proxysql sidecar and litmus chaos problem

Question

proxysql sidecar and litmus chaos problem

Opened this issue 20 days ago · 2 comments

jan-machacek-kosik commented 20 days ago

A clear description of the issue

I have a pod with multiple containers. One of them is ProxySQL, and the other is, for example, the otel-agent sidecar.

When I run a Litmus chaos experiment that is supposed to corrupt the connection from the otel-agent to the otel-collector (an external service), the connection from ProxySQL to the database is also corrupted.

The Litmus experiment executes these commands:

sudo nsenter -t 1217772 -n tc qdisc replace dev eth0 root handle 1: prio
sudo nsenter -t 1217772 -n tc qdisc replace dev eth0 parent 1:3 netem loss 100
sudo nsenter -t 1217772 -n tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 10.0.30.140 flowid 1:3
sudo nsenter -t 1217772 -n tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dport 4318 0xffff flowid 1:3

The ip 10.0.30.140 is otel-collector service.
I checked that connection to database from other sidecar is possible, but proxysql generate these errors during the experiment:

2024-07-10 07:55:56 MySQL_Monitor.cpp:2747:monitor_ping(): [ERROR] Server xxx.mysql.database.azure.com:3306 missed 3 heartbeats, shunning it and killing all the connections. Disabling other checks until the node comes back online.
2024-07-10 07:55:57 mysql_connection.cpp:1101:handler(): [ERROR] Connect timeout on mysql-ne-all-dev-web-eoc-mysql-master.mysql.database.azure.com:3306 : exceeded by 347us
2024-07-10 07:55:57 MySQL_HostGroups_Manager.cpp:2988:get_random_MySrvC(): [ERROR] Hostgroup 0 has no servers available! Checking servers shunned for more than 1 second
2024-07-10 07:55:59 MySQL_HostGroups_Manager.cpp:2988:get_random_MySrvC(): [ERROR] Hostgroup 0 has no servers available! Checking servers shunned for more than 1 second
2024-07-10 07:55:59 mysql_connection.cpp:1101:handler(): [ERROR] Connect timeout on mysql-ne-all-dev-web-eoc-mysql-master.mysql.database.azure.com:3306 : exceeded by 10us
2024-07-10 07:56:00 mysql_connection.cpp:1063:handler(): [ERROR] Failed to mysql_real_connect() on 0:xxx.mysql.database.azure.com:3306 , FD (Conn:35 , M yDS:35) , 2026: Unknown SSL error.
2024-07-10 07:56:00 MySQL_Session.cpp:2781:handler_again___status_CONNECTING_SERVER(): [ERROR] Max connect timeout reached while reaching hostgroup 0 after 10656ms . HG status: [{"Bytes_recv":"5827608","Bytes_sent":"22393908","ConnERR":"429","ConnFree":"0","ConnOK":"7","ConnUsed":"1","Latency_us":"211407","MaxConnUsed":"4","Queries":"18298","Queries_GTID_sync":"0","host 
group":"0","srv_host":"mysql-ne-all-dev-web-eoc-mysql-master.mysql.database.azure.com","srv_port":"3306","status":"ONLINE"}]
2024-07-10 07:56:00 mysql_connection.cpp:1063:handler(): [ERROR] Failed to mysql_real_connect() on 0:xxx.mysql.database.azure.com:3306 , FD (Conn:38 , MyDS:38) , 2026: Unknown SSL error.

ProxySQL version
2.4.4-44-g3b13c7c
OS version
ubuntu:focal-20240216

Thank you!

Answer 1 · 2024-07-10T09:23:26.000Z

Hi Jan.

Thank you for the report.
Can you please provide the full error log?
From the snippet it seems that connections are timing out .
What is your expected result?

My understanding is that netem loss 100 causes a 100% packet loss

Answer 2 · 2024-07-10T10:22:07.000Z

Thank you for your reply.

If I understand the netem commands correctly, the loss 100 is applied only to ip dst 10.0.30.140 (the third command).

I also tested the connection to the database from another container in the same pod, and it works. The interruption is only on the ProxySQL sidecar container.

I attached the full log.
error.log