KastB/r8169

Problem "transmit queue 0 timed out" with this patched driver

Opened this issue · 4 comments

Hello,

tested this version and got much better power efficency (more than 3 watts) on an Fujitsu D3400 mainboard with an Celeron G3900 processor (Skylake).

But system losts network connection with netio stress tests, about 30 seconds every time:

[  188.786003] [drm] RC6 on
[  203.477795] r8169 0000:01:00.0 eth0: PCI error (cmd = 0xffff, status = 0xffff)
[  203.482593] r8169 0000:01:00.0 eth0: link up
[  203.496237] r8169 0000:01:00.0 eth0: link up
[  204.409424] r8169 0000:01:00.0 eth0: PCI error (cmd = 0xffff, status = 0xffff)
[  204.414806] r8169 0000:01:00.0 eth0: link up
[  204.428215] r8169 0000:01:00.0 eth0: link up
[  210.785226] [drm] RC6 on
[  232.784823] [drm] RC6 on
[  254.784358] [drm] RC6 on
[  276.783859] [drm] RC6 on
[  298.783386] [drm] RC6 on
[  312.790641] ------------[ cut here ]------------
[  312.790666] WARNING: CPU: 1 PID: 0 at /build/linux-lVEVrl/linux-4.7.8/net/sched/sch_generic.c:272 dev_watchdog+0x220/0x230
[  312.790672] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
[  312.790675] Modules linked in: cpuid msr cpufreq_stats cpufreq_userspace cpufreq_powersave cpufreq_conservative nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel hmac drbg ansi_cprng iTCO_wdt iTCO_vendor_support evdev aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd snd_hda_intel pcspkr serio_raw snd_hda_codec i915 snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore i2c_i801 hci_uart btbcm btqca btintel drm_kms_helper drm mei_me bluetooth mei shpchp i2c_algo_bit fujitsu_laptop wmi intel_lpss_acpi rfkill intel_lpss mfd_core video acpi_pad tpm_crb button tpm_tis tpm vboxnetadp(OE)
[  312.790765]  vboxnetflt(OE) vboxdrv(OE) fuse autofs4 ext4 crc16 jbd2 mbcache btrfs xor raid6_pq dm_mod raid1 md_mod sg sd_mod crc32c_intel psmouse ahci libahci r8169(OE) mii xhci_pci libata xhci_hcd usbcore scsi_mod usb_common fan thermal i2c_hid hid fjes
[  312.790802] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G     U     OE   4.7.0-0.bpo.1-amd64 #1 Debian 4.7.8-1~bpo8+1
[  312.790805] Hardware name: FUJITSU D3400-B1/D3400-B1, BIOS V5.0.0.11 R1.17.0 for D3400-B1x                    09/16/2016
[  312.790809]  0000000000000286 6e251c1d7a9fa6e3 ffffffff84d1c805 ffff88022f503e18
[  312.790814]  0000000000000000 ffffffff84a7c9c4 0000000000000000 ffff88022f503e70
[  312.790820]  ffff880222292000 0000000000000001 ffff8802220d5080 0000000000000001
[  312.790824] Call Trace:
[  312.790827]  <IRQ>  [<ffffffff84d1c805>] ? dump_stack+0x5c/0x77
[  312.790842]  [<ffffffff84a7c9c4>] ? __warn+0xc4/0xe0
[  312.790848]  [<ffffffff84a7ca3f>] ? warn_slowpath_fmt+0x5f/0x80
[  312.790856]  [<ffffffff84f08120>] ? dev_watchdog+0x220/0x230
[  312.790862]  [<ffffffff84f07f00>] ? dev_deactivate_queue.constprop.32+0x60/0x60
[  312.790869]  [<ffffffff84ae6910>] ? call_timer_fn+0x30/0x120
[  312.790875]  [<ffffffff84f07f00>] ? dev_deactivate_queue.constprop.32+0x60/0x60
[  312.790880]  [<ffffffff84ae7881>] ? run_timer_softirq+0x231/0x2e0
[  312.790887]  [<ffffffff84fe20b6>] ? __do_softirq+0x106/0x294
[  312.790891]  [<ffffffff84a82306>] ? irq_exit+0x86/0x90
[  312.790897]  [<ffffffff84fe1ebe>] ? smp_apic_timer_interrupt+0x3e/0x50
[  312.790901]  [<ffffffff84fe01e2>] ? apic_timer_interrupt+0x82/0x90
[  312.790903]  <EOI>  [<ffffffff84ea38e2>] ? cpuidle_enter_state+0x112/0x260
[  312.790915]  [<ffffffff84abe31e>] ? cpu_startup_entry+0x2be/0x360
[  312.790921]  [<ffffffff84a4e1c1>] ? start_secondary+0x151/0x190
[  312.790926] ---[ end trace abcf6597c4d62904 ]---
[  312.810459] r8169 0000:01:00.0 eth0: link up
[  320.782685] [drm] RC6 on

System Debian Jessie with 4.7.x backports kernel. Same problem with an 4.8.7 and 4.8.10 kernel. The original Realtek Driver r8168 is working for me, but not so much energy efficient and lesser traffic throughput.

KastB commented

Sorry for that really late answer.
I couln't reproduce this error on any hardware configuration I have available, thus it's pretty hard for me to solve this bug.
Did you try to enable aspm with the r8168 as well?
Which chip do you have (dmesg should tell it)

Hi,
for me i have now a stable, working system with the r8168 driver and enabled aspm. My motherboard is a fujitsu d3400-b with an Intel H110 chipset.

lspci:

00:00.0 Host bridge: Intel Corporation Sky Lake Host Bridge/DRAM Registers (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Device 1902 (rev 06)
00:14.0 USB controller: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller (rev 31)
00:14.2 Signal processing controller: Intel Corporation Sunrise Point-H Thermal subsystem (rev 31)
00:16.0 Communication controller: Intel Corporation Sunrise Point-H CSME HECI #1 (rev 31)
00:17.0 SATA controller: Intel Corporation Device a102 (rev 31)
00:1c.0 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #7 (rev f1)
00:1f.0 ISA bridge: Intel Corporation Sunrise Point-H LPC Controller (rev 31)
00:1f.2 Memory controller: Intel Corporation Sunrise Point-H PMC (rev 31)
00:1f.3 Audio device: Intel Corporation Sunrise Point-H HD Audio (rev 31)
00:1f.4 SMBus: Intel Corporation Sunrise Point-H SMBus (rev 31)
01:00.0 Ethernet controller: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 0c)
KastB commented

Hi,
I tried to dig into that problem and compared the r8168 to the r8169 module. There are some additional registers that are changed by the r8168 driver when ASPM is enabled. Unfortunatelly I couldnt't find any documentation about those registers and I have no idea why the original kernel patch, that I adopted, didn't set them. When I get my hands on hardware with a similar error I can do some trial and error debugging, but for the meantime I ran out of ideas. I'm sorry...

Hi,
no problem, my main goal was to give feedback to this patch. For me a have a good working solution with r8168. Thanks for your analysis.