openwrt/mt76

MT7981 5GHz occasionally cannot disconnect clients that have left and causes bad performance.

victor186 opened this issue ยท 29 comments

I'm testing AX3000T on a restaurant for future network upgrade, but a've noticed poor speeds on 5GHz ramdomly, solved with radio restart, but when it occours, the network goes down due to low speed/high latency.

The AP is running on 80MHz/AX mode.
Openwrt 23.05.5.
Screenshot_20241019-203131_Chrome~2

Screenshot_20241020-195607_Speedtest

Is the second device also connected to the network?
I see really bad signal from it, communication with such device can highly decrease performance.

Is the second device also connected to the network? I see really bad signal from it, communication with such device can highly decrease performance.

This devices on list is in 2.4GHz

Can you list your wifi clients(device models)?

Can you list your wifi clients(device models)?

I can't, due this device is running as AP on a restaurant for administrative and client's Wi-Fi

Looks like Qualcomm QCA9377 + windows 10 driver + 5GHz can cause this. No problems on 2.4 band.

Do you have driver 10.0.0.1272 for Windows installed?

Looks like Qualcomm QCA9377 + windows 10 driver + 5GHz can cause this. No problems on 2.4 band.

I not understood, Wi-Fi 5GHz adapter with QCA9377 is causing 5GHz network bad performance? I don't have QCA9377 on network and the router is mediatek.

@victor186

I don't have QCA9377 on network

How can you be sure?

device is running as AP on a restaurant for administrative and client's Wi-Fi

@victor186

I don't have QCA9377 on network

How can you be sure?

device is running as AP on a restaurant for administrative and client's Wi-Fi

The clients only use smartphones.
The unique PC on Wi-Fi is using a realtek wi-fi adapter

@victor186

I don't have QCA9377 on network

How can you be sure?

device is running as AP on a restaurant for administrative and client's Wi-Fi

The clients only use smartphones. The unique PC on Wi-Fi is using a realtek wi-fi adapter

If QCA9377 can affect 5GHz AP on mt76+mt7915(mt7981), then maybe some other clients can do the same.

I'm not an owner of QCA9377. I just helped a user to isolate the problem on openwrt 23.05.5 mt7981 device.

@nbd168 what do you think about this?

One thing you could try is copy the latest MT7981 firmware from https://github.com/openwrt/mt76/tree/master/firmware to your device. If that doesn't help, trying a recent snapshot might also be a good idea.

One thing you could try is copy the latest MT7981 firmware from https://github.com/openwrt/mt76/tree/master/firmware to your device.

Already done this, it didn't help.

If that doesn't help, trying a recent snapshot might also be a good idea.

That user didn't want to experiment with snapshot. Connecting QCA9377 to 2.4GHz AP solved issue with 5GHz AP for him.

I'd say there are too little details we could help you

Openwrt 23.05.5. H3C Magic NX30 Pro.

Same issue here. Encountered it several times

Almost zero speed (1kb/s) through 5G wifi. Enough for DHCP but anything else will be broken, even ping.

I noticed that when this happening, there are 2 dead clients (which maybe leave the wifi range at the same time) in luci wifi page. With RX Rate / TX Rate 6.0 Mbit/s, 20 MHz. If I manually click the "Disconnect" button, the wifi works again immediately.

More info

Also, when I check the log. The log keeps showing that the two offline clients were still AP-STA-POLL-OK. Started when they were out of the wifi range, till I clicked the luci "Disconnect" button.

P.S. OFFLINE:MAC:1 OFFLINE:MAC:2 are clients that went away.

Wed Nov 20 19:33:33 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **OFFLINE:MAC:1**
Wed Nov 20 19:35:31 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **OFFLINE:MAC:2**
Wed Nov 20 19:38:44 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **OFFLINE:MAC:1**
Wed Nov 20 19:40:51 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **OFFLINE:MAC:2**
Wed Nov 20 19:44:03 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **OFFLINE:MAC:1**
...
Wed Nov 20 20:06:42 2024 daemon.notice hostapd: phy1-ap0: AP-STA-DISCONNECTED **OFFLINE:MAC:1**
Wed Nov 20 20:06:44 2024 daemon.notice hostapd: phy1-ap0: AP-STA-DISCONNECTED **OFFLINE:MAC:2**
Wed Nov 20 20:06:47 2024 daemon.info hostapd: phy1-ap0: STA **OFFLINE:MAC:1** IEEE 802.11: deauthenticated due to local deauth request
Wed Nov 20 20:06:49 2024 daemon.info hostapd: phy1-ap0: STA **OFFLINE:MAC:2** IEEE 802.11: deauthenticated due to local deauth request

When I restart the 5g wifi a few minutes later. Another sus log.

Wed Nov 20 20:13:06 2024 kern.warn kernel: [2135649.716364] Ignoring NSS change in VHT Operating Mode Notification from **OFFLINE:MAC:1** with invalid nss 2
Wed Nov 20 20:13:06 2024 kern.info kernel: [2143605.339316] device phy1-ap0 left promiscuous mode
Wed Nov 20 20:13:06 2024 kern.info kernel: [2143605.354371] br-lan: port 5(phy1-ap0) entered disabled state
Wed Nov 20 20:13:07 2024 daemon.notice wpa_supplicant[1538]: Set new config for phy phy1
Wed Nov 20 20:13:07 2024 daemon.notice hostapd: Set new config for phy phy1: /var/run/hostapd-phy1.conf
Wed Nov 20 20:13:07 2024 daemon.notice hostapd: Reload config for bss 'phy1-ap0' on phy 'phy1'
Wed Nov 20 20:13:07 2024 daemon.notice hostapd: phy1-ap0: AP-STA-DISCONNECTED **AN:ONLINE:CLIENT:MAC:1**
Wed Nov 20 20:13:08 2024 daemon.notice hostapd: Reloaded settings for phy phy1
Wed Nov 20 20:13:08 2024 daemon.notice netifd: Wireless device 'radio1' is now up
Wed Nov 20 20:13:08 2024 daemon.notice netifd: Network device 'phy1-ap0' link is up
Wed Nov 20 20:13:08 2024 kern.info kernel: [2143607.148600] br-lan: port 5(phy1-ap0) entered blocking state
Wed Nov 20 20:13:08 2024 kern.info kernel: [2143607.154384] br-lan: port 5(phy1-ap0) entered disabled state
Wed Nov 20 20:13:08 2024 kern.info kernel: [2143607.160337] device phy1-ap0 entered promiscuous mode
Wed Nov 20 20:13:08 2024 kern.info kernel: [2143607.165646] br-lan: port 5(phy1-ap0) entered blocking state
Wed Nov 20 20:13:08 2024 kern.info kernel: [2143607.171424] br-lan: port 5(phy1-ap0) entered forwarding state
Wed Nov 20 20:13:09 2024 daemon.info dnsmasq[1]: read /etc/hosts - 12 names
Wed Nov 20 20:13:09 2024 daemon.info dnsmasq[1]: read /tmp/hosts/dhcp.cfg01411c - 4 names
Wed Nov 20 20:13:09 2024 daemon.info dnsmasq-dhcp[1]: read /etc/ethers - 0 addresses
...

Wireless config

cat /etc/config/wireless

config wifi-device 'radio0'
        option type 'mac80211'
        option path 'platform/18000000.wifi'
        option channel '1'
        option band '2g'
        option htmode 'HT20'
        option country 'CN'
        option cell_density '0'

config wifi-iface 'default_radio0'
        option device 'radio0'
        option network 'lan'
        option mode 'ap'
        option ssid 'ssid1'
        option encryption 'psk2+ccmp'
        option key 'WIFIPASSWD'

config wifi-device 'radio1'
        option type 'mac80211'
        option path 'platform/18000000.wifi+1'
        option channel '149'
        option band '5g'
        option htmode 'HE80'
        option country 'CN'
        option cell_density '0'
        option txpower '27'

config wifi-iface 'default_radio1'
        option device 'radio1'
        option network 'lan'
        option mode 'ap'
        option ssid 'ssid2'
        option encryption 'sae-mixed'
        option key 'WIFIPASSWD'

May related:
openwrt/openwrt#14415

I reproduced this bug.

If a client leaves the WiFi coverage, there is a certain probability (10%? i guess) that the above bug will occur.

It is almost the same as this issue openwrt/openwrt#14415 . But it also causes bad wifi performance. (In my case this is extremely bad, < 1kb/s, other clients can still connect but only enough for DHCP to complete and anything else will be broken, even ping.)

Log keeps showing AP-STA-POLL-OK after the client left. (p.s. I added option max_inactivity '60'. )

...
Thu Nov 21 09:25:38 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:26:46 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:27:56 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:29:04 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:30:24 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:31:33 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:32:39 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:33:44 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:34:51 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
...

iw shows the client still "associated".

iw dev phy1-ap0 station dump

Station **WENT:AWAY:CLIENT:MAC** (on phy1-ap0)
        inactive time:  46190 ms
        rx bytes:       7315589
        rx packets:     52352
        tx bytes:       66444699
        tx packets:     69473
        tx retries:     6987
        tx failed:      7033
        rx drop misc:   2
        signal:         -95 [-97, -99] dBm
        signal avg:     -91 [-93, -95] dBm
        tx bitrate:     6.0 MBit/s
        tx duration:    83677141 us
        rx bitrate:     6.0 MBit/s
        rx duration:    4720659 us
        last ack signal:-96 dBm
        avg ack signal: -95 dBm
        airtime weight: 256
        authorized:     yes
        authenticated:  yes
        associated:     yes
        preamble:       short
        WMM/WME:        yes
        MFP:            no
        TDLS peer:      no
        DTIM period:    2
        beacon interval:100
        short preamble: yes
        short slot time:yes
        connected time: 8708 seconds
        associated at [boottime]:       2183028.795s
        associated at:  1732143676976 ms
        current time:   1732152384528 ms

p.s. Above device is a smartphone with snapdragon FastConnect 6800 (However, I do believe other clients can do the same.). It left the wifi range hour ago and kilometers away from wifi.

If I manually click the "Disconnect" button in luci, the wifi works again immediately, (no restart).

I'm using the offical unmodified Openwrt 23.05.5 image. openwrt/openwrt#14415 seems using a fork openwrt with a modified driver(?) (I misunderstund, they enabled /sys/module/mt7915e/parameters/wed_enable.).

I did not set the wed_enable.

cat /sys/module/mt7915e/parameters/wed_enable
N

I reproduced this bug. If a client leaves the WiFi coverage, there is a certain probability that the above bug will occur.

It is almost the same as this issue openwrt/openwrt#14415 .

Log keeps showing (p.s. I added option max_inactivity '60'.)

Thu Nov 21 09:25:38 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:26:46 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:27:56 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:29:04 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:30:24 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:31:33 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:32:39 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:33:44 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
Thu Nov 21 09:34:51 2024 daemon.notice hostapd: phy1-ap0: AP-STA-POLL-OK **WENT:AWAY:CLINET:MAC**
...
iw dev phy1-ap0 station dump

Station **WENT:AWAY:CLIENT:MAC** (on phy1-ap0)
        inactive time:  46190 ms
        rx bytes:       7315589
        rx packets:     52352
        tx bytes:       66444699
        tx packets:     69473
        tx retries:     6987
        tx failed:      7033
        rx drop misc:   2
        signal:         -95 [-97, -99] dBm
        signal avg:     -91 [-93, -95] dBm
        tx bitrate:     6.0 MBit/s
        tx duration:    83677141 us
        rx bitrate:     6.0 MBit/s
        rx duration:    4720659 us
        last ack signal:-96 dBm
        avg ack signal: -95 dBm
        airtime weight: 256
        authorized:     yes
        authenticated:  yes
        associated:     yes
        preamble:       short
        WMM/WME:        yes
        MFP:            no
        TDLS peer:      no
        DTIM period:    2
        beacon interval:100
        short preamble: yes
        short slot time:yes
        connected time: 8708 seconds
        associated at [boottime]:       2183028.795s
        associated at:  1732143676976 ms
        current time:   1732152384528 ms

p.s. Above device is a smartphone with snapdragon FastConnect 6800 (However, I do believe other clients can do the same.). It left the wifi range hour ago and kilometers away from wifi.

If I manually click the "Disconnect" button in luci, the wifi works again immediately, (no restart).

I'm using the offical unmodified Openwrt 23.05.5 image. openwrt/openwrt#14415 seems using a fork openwrt with a modified driver(?) (I misunderstund, they enabled /sys/module/mt7915e/parameters/wed_enable.).

I did not set the wed_enable.

cat /sys/module/mt7915e/parameters/wed_enable
N

It's make sense, because the router as public Wi-Fi have client's entering and quiting the network at all time.
And i noticed via luci some client's with signal -9x dBm that never disconnect's, like your example, client out of range never disapears.

You can try this patch from mtk

You can try this patch from mtk

I don't know how to use this

Sorry. My router is a main device, It is hard for me to play with it. But I can provide log if needed.

@victor186 I feel this is a common bug, for all MT7981, but it happens occasionally, hard to reproduce and notice.

Maybe we could change the title to make it easier for more users to find?

"MT7981 5GHz occasionally cannot disconnect clients that have left and causes bad performance."

Sorry. My router is a main device, It is hard for me to play with it. But I can provide log if needed.

@victor186 I feel this is a common bug, for all MT7981, but it happens occasionally, hard to reproduce and notice.

Maybe we could change the title to make it easier for more users to find?

"MT7981 5GHz occasionally cannot disconnect clients that have left and causes bad performance."

Done

A dirty temp fix. Tested, works for me. Do not know if there is any side effect.

Run this script every minute via cron.

It will "disconnect" all clients that have a very very low signal strength (should be the clients that have already left the wifi coverage but still buggy as "associated".).

#!/bin/sh

# threshold (dBm)
thr=-90
# add other interface name if any, "phy1-ap0 phy1-ap1 phy1-ap2"
wlanlist="phy1-ap0" 

disconnect() {
        mac=$1
        wlan=$2
        rssi=$3
        echo "disconnecting client at $wlan $mac with $rssi dBm (thr=$thr)" | logger -t disconnected-client-killer
        ubus call hostapd.$wlan del_client "{'addr':'$mac', 'reason':5, 'deauth':true, 'ban_time':1000}"
        # "ban_time" prohibits the client to reassociate for the given amount of milliseconds.
}

for wlan in $wlanlist; do
        iwinfo ${wlan} assoclist | grep SNR | while read line; do
                mac=$(echo "${line}" | awk '{ print $1 }')
                rssi=$(echo "${line}" | awk '{ print $2 }')
                if [ $rssi -lt $thr ]; then
                        disconnect $mac $wlan $rssi
                fi
        done
done

You can try this patch from mtk

This patch def does some good thing, before i had intermittent packet loss indication every min or less in games, now thats completely fixed with this patch.

You can try this patch from mtk

This patch def does some good thing, before i had intermittent packet loss indication every min or less in games, now thats completely fixed with this patch.

I tried this patch, and speed dropped 2x times with inactive WED.

You can try this patch from mtk

This patch def does some good thing, before i had intermittent packet loss indication every min or less in games, now thats completely fixed with this patch.

I tried this patch, and speed dropped 2x times with inactive WED.

I dont notice a speed difference with WED enabled.

Below client has left the house, but the MT6000 still sees/tracks it with a -92/-92 RSSI, ugh

17335763009011246083969200368102

Using a pretty recent OpenWrt SNAPSHOT, r28242, with:

mt798x-wmac 18000000.wifi: WM Firmware Version: ____000000, Build Time: 20240823160721
mt798x-wmac 18000000.wifi: WA Firmware Version: DEV_000000, Build Time: 20240823160840

Stressing roamings with DAWN and or disconnects by walking of bounds seem to trigger that odd condition.

I might try the cron job workarounnd. Since this is affecting my mesh network as batctl ends with nodes with 0.3 crawling link-speeds.

Observed similar AP-STA-POLL-OK logs with my Flint 2 on 2.4G WiFi.

A dirty temp fix. Tested, works for me. Do not know if there is any side effect.

Run this script every minute via cron.

It will "disconnect" all clients that have a very very low signal strength (should be the clients that have already left the wifi coverage but still buggy as "associated".).

I have adapted your solution and started using it to workaround this for my case too.

gist:openwrt-mt76-disconnect-workaround

This version can be added under init / rc scripts since it spawns a subshell on boot that keeps checking for the condition every N seconds.

Another slight change is there is no need to set a threshold, it instead considers that if the signal is lower than the noise floor.

We understand this is just a temporary workaround while we wait for the real solution, and also wonder if that MTK ref from losing the ACK on AX chips is related.