HS58 Linux crash in tcp_check_space when running iperf3

Question

HS58 Linux crash in tcp_check_space when running iperf3

jzbydniewski opened this issue 6 months ago · 22 comments

Using Linux branch abrodkin-arc64-5.15.y, updated to Linux 5.15.127. Built with arc-2023.09 buildroot, using 2023.09 toolchain with latest patches.

Linux running in SMP mode.
I can observe crash in tcp_check_space, when running iperf3 over PCIe WiFi or Eth adapter, see below:

iperf3 -c 192.168.50.180 -t 0 -p 5201 &
iperf3 -c 192.168.50.180 -t 0 -p 5202 &

Let iperf3 run for minutes or hours (usually)

Oops
Path: (null)
CPU: 1 PID: 54 Comm: kworker/1:1 Tainted: G           O      5.15.127 #2
Workqueue: mm_percpu_wq vmstat_update
Invalid Read @ 0x000001e4 by insn @ tcp_check_space+0x4/0xb4
ECR: 0x00050100 EFA: 0x000001e4 ERET: 0x8170f49c
STAT: 0x80080802 [IE K     ]   BTA: 0x8170f498
SP: 0x82197c84  FP: 0x00000000 BLK: tcp_rcv_established+0x3fa/0x6ac
r00: 0x00000000 r01: 0x00000102 r02: 0x00000103
r03: 0x0000b1ce r04: 0xabe44598 r05: 0x00008000
r06: 0xabe44598 r07: 0x03800000 r08: 0x99999955
r09: 0x00000060 r10: 0x00000000 r11: 0x708a07f3
r12: 0xac9ea6f0

Stack Trace:
  tcp_check_space+0x4/0xb4
  tcp_rcv_established+0x3fa/0x6ac
  tcp_v4_do_rcv+0x128/0x1b4
  tcp_v4_rcv+0x8b2/0x944
  ip_protocol_deliver_rcu+0x24/0x208
  ip_local_deliver_finish+0x42/0x70
  ip_sublist_rcv_finish+0x24/0x34
  ip_sublist_rcv+0x108/0x124
  ip_list_rcv+0xae/0xc8
  __netif_receive_skb_list_core+0x110/0x15c
  netif_receive_skb_list_internal+0x188/0x268
  __napi_poll+0x7a/0x108
  net_rx_action+0xd4/0x220
  __do_softirq+0x110/0x290
  irq_exit+0xae/0xc4
  handle_domain_irq+0x7a/0xb4
  handle_interrupt+0x60/0x64

From my observations:

disabling HW prefetcher doesn't help - issue is still there
issue doesn't reproduce with only single core SMP (so also no reproduction on nSIM)
with enabled ftrace issue is not reproducible, probably due to huge performance drop
not reproducible with iperf3 running on localhost

Answer 1 · 2024-03-07T13:10:46.000Z

It looks that reproduction rate can be decreased (how much is still to be determined, not sure if it is fix or not) by the following patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 6849094e5e5e..35ec4900720a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5471,7 +5471,7 @@ void tcp_check_space(struct sock *sk)
        }
 }

-static inline void tcp_data_snd_check(struct sock *sk)
+__attribute__ ((optimize(0))) static  inline void tcp_data_snd_check(struct sock *sk)
 {
        tcp_push_pending_frames(sk);
        tcp_check_space(sk);
@@ -5802,7 +5802,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
  *     the rest is checked inline. Fast processing is turned on in
  *     tcp_data_queue when everything is OK.
  */
-void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
+__attribute__ ((optimize(0))) void tcp_rcv_established(struct sock *sk, struct sk_buff *skb)
 {
        const struct tcphdr *th = (const struct tcphdr *)skb->data;
        struct tcp_sock *tp = tcp_sk(sk);

here are listings of the tcp_rcv_established in the default, and optimize(0) variants:
tcp_tcv_established-no-opt0.txt
tcp_tcv_established-opt0.txt

Answer 2 · 2024-03-12T07:02:29.000Z

After few more days of testing, still no reproduction with optimize(0) patch.

Answer 3 · 2024-03-12T15:36:22.000Z

Another stack trace, now with PCIe eth adapter based on RTL8169

F:  [  5] 4818.01-4819.01 sec  35.0 MBytes   293 Mbits/sec    0   4.01 MBytes       
F:  [  5] 4819.01-4820.00 sec  36.2 MBytes   307 Mbits/sec    0   4.01 MBytes       
R:  [  5] 4819.00-4820.02 sec  61.8 MBytes   509 Mbits/sec                  

Oops
Path: (null)
CPU: 1 PID: 20 Comm: ksoftirqd/1 Tainted: G           O      5.15.127 #2
Invalid Read @ 0x000001e4 by insn @ tcp_check_space+0x4/0xb4
ECR: 0x00050100 EFA: 0x000001e4 ERET: 0x8176a250
STAT: 0x80080802 [IE K     ]   BTA: 0x8176a24c
 SP: 0x82075cc4  FP: 0x00000002 BLK: tcp_rcv_established+0x3fa/0x6ac
r00: 0x00000000r01: 0x00000101r02: 0x00000102
r03: 0x001a122er04: 0xabe4459cr05: 0x00000000
r06: 0xabe44598r07: 0x0b800000r08: 0xf1a9f5b5
r09: 0x000008b9r10: 0x00000267r11: 0x00000000
r12: 0x00000140

Stack Trace:
  tcp_check_space+0x4/0xb4
  tcp_rcv_established+0x3fa/0x6ac
  tcp_v4_do_rcv+0x128/0x1b4
  tcp_v4_rcv+0x8b2/0x944
  ip_protocol_deliver_rcu+0x24/0x208
  ip_local_deliver_finish+0x42/0x70
  ip_sublist_rcv_finish+0x24/0x34
  ip_sublist_rcv+0x108/0x124
  ip_list_rcv+0xae/0xc8
  __netif_receive_skb_list_core+0x110/0x15c
  netif_receive_skb_list_internal+0x188/0x268
  napi_gro_complete.constprop.0.isra.0+0xca/0xfc
  dev_gro_receive+0x1d6/0x490
  napi_gro_receive+0x3c/0x118
  rtl8169_poll+0x1c2/0x4b0
  __napi_poll+0x26/0x108
  net_rx_action+0xd4/0x220
  __do_softirq+0x110/0x290
  run_ksoftirqd+0x54/0x70
  smpboot_thread_fn+0x142/0x1cc
  kthread+0xe8/0x120
  ret_from_fork+0x14/0x18

Answer 4 · 2024-03-14T09:36:19.000Z

I had a look at difference between vmlinux with optimize(0) patch and without it. It looks that it’s not only these two functions affected, but in fact few more functions that with default optimization were inlined (or optimized-out - it’s difficult to determine that) and with optimize(0) are just regular functions. This is the list functions present only with optimize(0) patch:

Answer 5 · 2024-03-29T07:35:10.000Z

The issue reproduces also with ATLD atomic implementation.

Answer 6 · 2024-04-02T15:38:59.000Z

I've tried to reproduce the issue in NSIM(HS58x3 configuration + virtio for ethernet adapter) but without success. It works with no errors.

Answer 7 · 2024-04-08T11:41:00.000Z

Example cache operations that I can see being used when iperf is running. Operations originate from our PCIe WiFi driver:

dma_cache_wback_inv addr 0x8728c740 sz 2048
dma_cache_wback_inv addr 0x8728b440 sz 2048
[..]
dma_cache_inv addr 0x86689918 sz 1500
dma_cache_wback addr 0x86662118 sz 1500
[..]
dma_cache_wback addr 0x8668f620 sz 28
[..]
dma_cache_inv addr 0x8665f918 sz 1500

It can be seen that not all operations are aligned to cache line (64B).

Answer 8 · 2024-04-09T11:36:28.000Z

With another PCIe-Eth adapter that we also observe this issue, the cache operations look like below.
We're using RTL8169 driver here.

[..]
dma_cache_wback addr 0x8223290a sz 1514
dma_cache_inv addr 0x8087c000 sz 60
dma_cache_wback addr 0x82b4790a sz 1514
dma_cache_inv addr 0x8087c000 sz 60
dma_cache_wback addr 0x82b4410a sz 1514
dma_cache_wback addr 0x8092290a sz 1514
[..]

There are only two cache operations used when iperf3 is running:

dma_cache_wback - > results in OP_FLUSH
-> addresses/size are not aligned to cache line, but this is only flush operation, that to my understanding shouldn't break anything.
dma_cache_inv -> results in OP_FLUSH_N_INV
-> always called on addresses aligned to page.
-> size is not aligned to cache line, but I can see in the driver that it operates on buffers that are full-page sized, so I don't see any issues with that.

Answer 9 · 2024-04-15T13:58:33.000Z

SmaRT traces + TLB dump from Metaware Debugger:

smart-cpu0.txt
smart-cpu1.txt
smart-cpu2.txt
tlb.txt

Answer 10 · 2024-04-18T09:14:20.000Z

Currently we're using a workaround described in #163 (comment)

I did some experiments to determine which part of the code has to have O0 optimization applied and here are the results.

The calling order is: tcp_rcv_established → tcp_data_snd_check → tcp_check_space

tcp_data_snd_check \ tcp_rcv_established	O2 (default)	O0
O2 (default)	BUG (default)	BUG (same tcp_data_snd_check lst)
O0	OK	OK (current workaround)
noinline	OK (same tcp_data_snd_check lst)	-

Conclusion 1:
Issue doesn’t occur when O0 optimization level is applied to tcp_data_snd_check function.
This suggests tcp_data_snd_check function may be an issue here.

Conclusion 2:
In one case the bug is reproducible, in the second it isn’t, but tcp_data_snd_check function listings in both cases are the same. Both have default optimization levels applied (O2) and both aren’t inlined. The first one has explicit noinline attribute and the second one isn’t inlined because the calling function has optimization O0 applied.
This implies tcp_data_snd_check code isn’t a problem.

As you can see, the conclusions are contradicting. The only explanation I can come up with is that it isn’t an issue in specific place of the code, but rather some timing or other problem elsewhere.

Here are the listings for mentioned functions for all cases:
vmlinux-current_workaround_opt0.txt
vmlinux-data_snd_noinline.txt
vmlinux-data_snd_opt0.txt
vmlinux-default_opt2.txt
vmlinux-rcv_established_opt0.txt

Answer 11 · 2024-05-10T11:18:12.000Z

Looking at stack traces, I can see that it's always one of registers (r0 usually, might depend on debug changes introduced) that contains invalid value. Could it be that there is something gets wrong on context switches ? Would it be possible to add a simple check of context on save/restore (basic checksum, or just a sum, calculated on store and vefiried on restore) to ensure nothing gets lost ?

Answer 12 · 2024-05-14T10:50:50.000Z

I added the following change to the tcp_check_space to catch the issue earlier:

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fd9067062f2a..0152a30bfc35 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5463,6 +5463,12 @@ void tcp_check_space(struct sock *sk)
 {
        /* pairs with tcp_poll() */
        smp_mb();
+
+  asm volatile(
+    "brlt    r0,0,2f\n\
+    flag 1\n\
+    2:\n");
+

From what I see, only these two instructions were added to the tcp_check_space, it didn't change any other part of this function.

I was able to reproduce issue with this change, here is what I can see in mdb:
core 1:

core 2:

core 3:

Here are SmaRT traces for each core:
smart-log-core-1.txt
smart-log-core-2.txt
smart-log-core-3.txt

It's interesting, that SmaRT trace for core 1 doesn't match with its call stack. Also, it looks that it was in fact r14 that is corrupted, as before call to tcp_check_space r14 is copied to r0:

it might be that r14 gets corrupted in another function - __tcp_push_pending_frames.

Answer 13 · 2024-05-14T11:21:47.000Z

Here is yet another reproduction, SmaRT:

smart-log-core-1.txt
smart-log-core-2.txt
smart-log-core-3.txt

Answer 14 · 2024-05-14T14:08:33.000Z

Another reproduction, now I can see SmaRT looks matching with call stack, here is my analysis of the situation

From what I see content of registers r16..r24 looks matching with contents of memory. However, the problem is with r14 value that doesn't match memory contents, while r15 that was loaded in the same instruction (see below, as double word load) is not corrupted (!).

Could it be another core corrupting r14 ? Is that even possible ?

Answer 15 · 2024-05-14T15:03:57.000Z

I think this issue can be linked with #168.
I will check everything related to stack save/restore registers. Probably we also need to move/remove pad2 field in the struct user_regs_struct{} or something else.

Answer 16 · 2024-05-14T20:05:53.000Z

Could you check this patch?
regs.patch

Answer 17 · 2024-05-15T07:48:40.000Z

@xxkent I tried with this patch but it's the same

Do I understand correctly that SmaRT is showing full trace of instructions executed by a particular CPU (even if there was a context switch at some point in time), and registers of one core should not be affected by other cores ?

Answer 18 · 2024-05-15T10:10:09.000Z

Yes, SmaRT shows last executed instructions for a particular core(3 traces from 3 cpu in your case). These traces can include switched context also.
Each core have their own execution context which contains registers set and each core can't impact directly on other cores registers. It is possible through, for example, saving context(registers) on one core to a memory and restoring it on another.

Answer 19 · 2024-05-15T12:34:15.000Z

Contents of dcache for each core + L2 cache:
dcache-core1.txt
dcache-core2.txt
dcache-core3.txt
l2-cache.txt

According to the contents of data cache for core 1 where the issue triggered, the ldd.ab r14[sp, 8] instruction operates on a memory location that belong to two data cache lines. For some reason, r15 got loaded with correct value while r14 didn't.

Regarding why there is "multi_cpu_stop" called, it looks like a result of clock source switching (see function timekeeping_notify)

Answer 20 · 2024-05-17T05:35:22.000Z

I'm testing now with kernel option CONFIG_ARC_HAS_LL64=n. No reproduction within 3 days. Buildroot didn't require change as it's not passing "mll64" to gcc, and from what I see by default gcc is not using these double load/store instructions.

What's interesting, this change is similar to optimize(0) workaround, as from what I see with optimize(0) compiler is not using double load/store instruction - at least for affected functions.

Answer 21 · 2024-05-17T10:16:19.000Z

This is what I see with -off=flush_dcache

it looks like the value that should have been loaded to r14 (at address 0x81c35cbc) is in L1 data cache line marked as dirty, and it's different to what memory view shows on the same address. On the contrary, data that has been loaded to r15 is the same in cache and memory. Any thoughts on this ?

Answer 22 · 2024-07-15T06:01:54.000Z

We've been asked to check if the bug reproduces with the delay slots instructions disabled.

Result
With -fno-delayed-branch compiler option the issue still reproduces, it took longer time but after 12h it eventually reproduced (with PCIe Ethernet card).

Detailed info
I re-compiled the Linux with this change to disable delay slots instructions:

diff --git a/arch/arc/Makefile b/arch/arc/Makefile
index d713690205e7..567a55b265ce 100644
--- a/arch/arc/Makefile
+++ b/arch/arc/Makefile
@@ -128,7 +128,7 @@ KBUILD_CFLAGS_MODULE        += -mcmodel=large
 endif
 
 # Finally dump eveything into kernel build system
-KBUILD_CFLAGS  += $(cflags-y)
+KBUILD_CFLAGS  += $(cflags-y) -fno-delayed-branch
 KBUILD_AFLAGS  += $(KBUILD_CFLAGS)
 KBUILD_LDFLAGS += $(ldflags-y)

I checked content of the vmlinux listing files (looking for .d instruction postfix) before and after the change.

With delayed slots instructions enabled there are ~230,000 occurrences of different branch-delay instructions, like bgt.d, brne.d, bl.d, jl.d, bl.d, b.d, j_s.d

With delayed slots disabled there were still some *.d branch instructions generated, but much less (~8000). The real difference is that there are no occurrences of instructions b.d, bl.d, but others listed above are still there.

Here is a comparison of a part of the function tcp_write_xmit in both cases:

On the left there are no .d instructions around series of ldd instructions.

Is this experiment valid? Does is tell anything meaningful? Why there are still some delay-slots instructions generated even though the -fno-delayed-branched instruction is used?