pcengines/apu2-documentation

NIC PCIe link lost

Closed this issue · 8 comments

After upgrading from v4.0.7 (which my APU3 shipped with) to 4.11.0.5, saw the on-board NIC disconnect. It seems to have come back fine after a reboot, at least for now. Note that the device is relatively new so its possible its unrelated to the firmware. snmp indicates the CPU temp was in the 50deg range when the issue appeared.

Debian buster 4.19.98-1.

Apr 20 21:44:24 sfo-router kernel: [127576.922378] igb 0000:02:00.0 enp2s0: PCIe link lost
Apr 20 21:44:29 sfo-router kernel: [127582.038035] ------------[ cut here ]------------
Apr 20 21:44:29 sfo-router kernel: [127582.038046] NETDEV WATCHDOG: enp2s0 (igb): transmit queue 0 timed out
Apr 20 21:44:29 sfo-router kernel: [127582.038093] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:466 dev_watchdog+0x20d/0x220
Apr 20 21:44:29 sfo-router kernel: [127582.038095] Modules linked in: binfmt_misc ip6_tables xt_nat xt_conntrack xt_mark nft_chain_nat_ipv4e
Apr 20 21:44:29 sfo-router kernel: [127582.038206]  sdhci aesni_intel mmc_core scsi_mod igb aes_x86_64 crypto_simd cryptd glue_helper i2c_ps
Apr 20 21:44:29 sfo-router kernel: [127582.038247] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G           OE     4.19.0-8-amd64 #1 Debian 4.19.1
Apr 20 21:44:29 sfo-router kernel: [127582.038250] Hardware name: PC Engines apu3/apu3, BIOS v4.11.0.5 03/29/2020
Apr 20 21:44:29 sfo-router kernel: [127582.038268] RIP: 0010:dev_watchdog+0x20d/0x220
Apr 20 21:44:29 sfo-router kernel: [127582.038273] Code: 00 49 63 4e e0 eb 92 4c 89 e7 c6 05 92 f2 ad 00 01 e8 37 b9 fc ff 89 d9 4c 89 e6 44
Apr 20 21:44:29 sfo-router kernel: [127582.038276] RSP: 0018:ffff8d7daab03e90 EFLAGS: 00010286
Apr 20 21:44:29 sfo-router kernel: [127582.038280] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
Apr 20 21:44:29 sfo-router kernel: [127582.038282] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff8d7daab166b0
Apr 20 21:44:29 sfo-router kernel: [127582.038285] RBP: ffff8d7da288045c R08: 0000000000000269 R09: 0000000000000004
Apr 20 21:44:29 sfo-router kernel: [127582.038287] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8d7da2880000
Apr 20 21:44:29 sfo-router kernel: [127582.038290] R13: 0000000000000002 R14: ffff8d7da2880480 R15: 0000000000000008
Apr 20 21:44:29 sfo-router kernel: [127582.038294] FS:  0000000000000000(0000) GS:ffff8d7daab00000(0000) knlGS:0000000000000000
Apr 20 21:44:29 sfo-router kernel: [127582.038297] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Apr 20 21:44:29 sfo-router kernel: [127582.038299] CR2: 00007efdb5646a10 CR3: 0000000126d74000 CR4: 00000000000406e0
Apr 20 21:44:29 sfo-router kernel: [127582.038302] Call Trace:
Apr 20 21:44:29 sfo-router kernel: [127582.038309]  <IRQ>
Apr 20 21:44:29 sfo-router kernel: [127582.038321]  ? pfifo_fast_enqueue+0x110/0x110
Apr 20 21:44:29 sfo-router kernel: [127582.038329]  call_timer_fn+0x2b/0x130
Apr 20 21:44:29 sfo-router kernel: [127582.038335]  run_timer_softirq+0x1c7/0x3e0
Apr 20 21:44:29 sfo-router kernel: [127582.038341]  ? __hrtimer_run_queues+0x130/0x280
Apr 20 21:44:29 sfo-router kernel: [127582.038347]  ? ktime_get+0x3a/0xa0
Apr 20 21:44:29 sfo-router kernel: [127582.038369]  __do_softirq+0xde/0x2d8
Apr 20 21:44:29 sfo-router kernel: [127582.038391]  irq_exit+0xba/0xc0
Apr 20 21:44:29 sfo-router kernel: [127582.038396]  smp_apic_timer_interrupt+0x74/0x140
Apr 20 21:44:29 sfo-router kernel: [127582.038402]  apic_timer_interrupt+0xf/0x20
Apr 20 21:44:29 sfo-router kernel: [127582.038405]  </IRQ>
Apr 20 21:44:29 sfo-router kernel: [127582.038412] RIP: 0010:native_safe_halt+0xe/0x10
Apr 20 21:44:29 sfo-router kernel: [127582.038416] Code: ff ff 7f c3 65 48 8b 04 25 40 5c 01 00 f0 80 48 02 20 48 8b 00 a8 08 75 c4 eb 80 90
Apr 20 21:44:29 sfo-router kernel: [127582.038418] RSP: 0018:ffffb6b5406afe38 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
Apr 20 21:44:29 sfo-router kernel: [127582.038422] RAX: 0000000080000000 RBX: ffff8d7da99d6c00 RCX: 0000000000000034
Apr 20 21:44:29 sfo-router kernel: [127582.038425] RDX: 4ec4ec4ec4ec4ec5 RSI: ffffffff89cba4e0 RDI: ffff8d7da99d6c64
Apr 20 21:44:29 sfo-router kernel: [127582.038427] RBP: ffff8d7da99d6c64 R08: 0000000000000002 R09: 0000000000021980
Apr 20 21:44:29 sfo-router kernel: [127582.038430] R10: 000073d893358344 R11: ffff8d7daab210a8 R12: 0000000000000001
Apr 20 21:44:29 sfo-router kernel: [127582.038432] R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
Apr 20 21:44:29 sfo-router kernel: [127582.038441]  acpi_safe_halt+0x1b/0x30
Apr 20 21:44:29 sfo-router kernel: [127582.038448]  acpi_idle_enter+0x103/0x2a0
Apr 20 21:44:29 sfo-router kernel: [127582.038459]  cpuidle_enter_state+0x71/0x320
Apr 20 21:44:29 sfo-router kernel: [127582.038467]  do_idle+0x228/0x270
Apr 20 21:44:29 sfo-router kernel: [127582.038489]  cpu_startup_entry+0x6f/0x80
Apr 20 21:44:29 sfo-router kernel: [127582.038510]  start_secondary+0x1a4/0x1f0
Apr 20 21:44:29 sfo-router kernel: [127582.038517]  secondary_startup_64+0xa4/0xb0
Apr 20 21:44:29 sfo-router kernel: [127582.038524] ---[ end trace ae502ea70790dd17 ]---
Apr 20 21:44:29 sfo-router kernel: [127582.038607] igb 0000:02:00.0 enp2s0: Reset adapter

@TheBlueMatt do you have the same problem with v4.11.0.4?

Downgraded, I'll let you know if it happens again, but its so far been a once a week occurrence with high variance.

Hmm, this is the first time I see something like that. It will be very difficult to investigate since its hard to reproduce (once a week is very low rate). I can see similar problems on other machines as well since 2014: https://sourceforge.net/p/e1000/bugs/430/

Closing as a likely hardware error - I haven't seen it on any other APUs and friends with similar firmwares haven't seen it either. PCIe is sensitive enough that hardware errors could totally explain it.

@TheBlueMatt I'd suggest trying the APU with a different PSU.

This is off the pcengines-supplied PSU....never had any trouble with the ones they sell before. In any case, let me try a new board and see if thats the issue first before we start speculating :)

Hi ,Have you sloved the PCIE link lost,i have the same problem with pxe8764