proxysql sidecar and litmus chaos problem
Opened this issue · 2 comments
- A clear description of the issue
I have a pod with multiple containers. One of them is ProxySQL, and the other is, for example, the otel-agent sidecar.
When I run a Litmus chaos experiment that is supposed to corrupt the connection from the otel-agent to the otel-collector (an external service), the connection from ProxySQL to the database is also corrupted.
The Litmus experiment executes these commands:
sudo nsenter -t 1217772 -n tc qdisc replace dev eth0 root handle 1: prio
sudo nsenter -t 1217772 -n tc qdisc replace dev eth0 parent 1:3 netem loss 100
sudo nsenter -t 1217772 -n tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst 10.0.30.140 flowid 1:3
sudo nsenter -t 1217772 -n tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dport 4318 0xffff flowid 1:3
The ip 10.0.30.140 is otel-collector service.
I checked that connection to database from other sidecar is possible, but proxysql generate these errors during the experiment:
2024-07-10 07:55:56 MySQL_Monitor.cpp:2747:monitor_ping(): [ERROR] Server xxx.mysql.database.azure.com:3306 missed 3 heartbeats, shunning it and killing all the connections. Disabling other checks until the node comes back online.
2024-07-10 07:55:57 mysql_connection.cpp:1101:handler(): [ERROR] Connect timeout on mysql-ne-all-dev-web-eoc-mysql-master.mysql.database.azure.com:3306 : exceeded by 347us
2024-07-10 07:55:57 MySQL_HostGroups_Manager.cpp:2988:get_random_MySrvC(): [ERROR] Hostgroup 0 has no servers available! Checking servers shunned for more than 1 second
2024-07-10 07:55:59 MySQL_HostGroups_Manager.cpp:2988:get_random_MySrvC(): [ERROR] Hostgroup 0 has no servers available! Checking servers shunned for more than 1 second
2024-07-10 07:55:59 mysql_connection.cpp:1101:handler(): [ERROR] Connect timeout on mysql-ne-all-dev-web-eoc-mysql-master.mysql.database.azure.com:3306 : exceeded by 10us
2024-07-10 07:56:00 mysql_connection.cpp:1063:handler(): [ERROR] Failed to mysql_real_connect() on 0:xxx.mysql.database.azure.com:3306 , FD (Conn:35 , M yDS:35) , 2026: Unknown SSL error.
2024-07-10 07:56:00 MySQL_Session.cpp:2781:handler_again___status_CONNECTING_SERVER(): [ERROR] Max connect timeout reached while reaching hostgroup 0 after 10656ms . HG status: [{"Bytes_recv":"5827608","Bytes_sent":"22393908","ConnERR":"429","ConnFree":"0","ConnOK":"7","ConnUsed":"1","Latency_us":"211407","MaxConnUsed":"4","Queries":"18298","Queries_GTID_sync":"0","host
group":"0","srv_host":"mysql-ne-all-dev-web-eoc-mysql-master.mysql.database.azure.com","srv_port":"3306","status":"ONLINE"}]
2024-07-10 07:56:00 mysql_connection.cpp:1063:handler(): [ERROR] Failed to mysql_real_connect() on 0:xxx.mysql.database.azure.com:3306 , FD (Conn:38 , MyDS:38) , 2026: Unknown SSL error.
-
ProxySQL version
2.4.4-44-g3b13c7c -
OS version
ubuntu:focal-20240216
Thank you!
Hi Jan.
Thank you for the report.
Can you please provide the full error log?
From the snippet it seems that connections are timing out .
What is your expected result?
My understanding is that netem loss 100
causes a 100% packet loss
Thank you for your reply.
If I understand the netem
commands correctly, the loss 100
is applied only to ip dst 10.0.30.140
(the third command).
I also tested the connection to the database from another container in the same pod, and it works. The interruption is only on the ProxySQL sidecar container.
I attached the full log.
error.log