XDP TX queues getting stuck when the tx posted packet counters overflows beyond u32 max
Closed this issue · 2 comments
We are observing an issue with 100% CPU usage without any packets being processed on some of the CPUs handling XDP tx queues
Our setup
Instance: GCE n2-standard-32
Configured queues: 4 rx, 4 tx (0-3 CPU cores are used for RX queues & XDP program and 4-7 CPU cores are used for handling XDP_TX work)
Driver Version: 1.3.4
Kernel/OS Version: Linux 6.1.0-17-cloud-amd64 SMP PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30) x86_64 GNU/Linux
We are attaching a eBPF/XDP program in native mode which modifies the packets and mostly returns with XDP_TX action
Observation
Continuous 100% CPU usage is observed on CPUs 6,7 which process XDP_TX packets while there isn't much usage for CPU 4,5 which also process XDP_TX packets. ksoftirqd process for 6,7 is consuming the 100% CPU
On checking the CPU Flame Graph for these cores, we see that most of the time is spent in gve_xdp_poll and gve_clean_xdp_done.
On checking the ethtool counters, we see that tx_posted_desc counter is lower than tx_completed_desc counter for 6,7
# ethtool -S ens4 | grep '\[[4-7]\]' | grep "posted\|completed" | grep tx
tx_posted_desc[4]: 1622967499
tx_completed_desc[4]: 1622967499
tx_posted_desc[5]: 2328007405
tx_completed_desc[5]: 2328007405
tx_posted_desc[6]: 154
tx_completed_desc[6]: 4294967274
tx_posted_desc[7]: 170
tx_completed_desc[7]: 4294967292
And tx_completed_desc for queues 6,7 are very close to uint32 max(2^32 = 4294967296) which indicates that tx_posted_desc could have overflown and reset which explains the low value
According to gve_clean_xdp_done
code logic, this will not go inside the for
loop since clean_end
after overflow would be lower than tx->done
and result in repoll all the time. This matches with our observation of counters (tx_posted/tx_completed) not getting incremented even is the CPU flame graph shows that time is spent in gve_clean_xdp_done
Similar logic for non XDP tx(gve_clean_tx_done
) has handled this scenario by executing for
loop starting from 0
till to_do
which could be the reason it's not seen in non XDP flows
Hello, thanks for the report, and sorry for the slow response. We were working on this internally, and this should be fixed in the next version (landing today or tomorrow).
Thanks again for this, I did manage to get the release out today - so this should be fixed in v1.4.2. Please let me know if you run into any isuses.