list of detected wifi access points is empty
Opened this issue · 45 comments
We often have the following case: when the rpi starts, it correctly connects to the wifi access point (that was previously configured using wifi-connect captive portal), but after a few minutes or a few hours, the wifi-connect is starting (probably because the connection was temporarily lost), and the captive portal shows an empty list of detected wifis.
But we have a lot of wifi access points around, so there must be an issue here as the list should not be empty...
The rpi never connects back to our initial wifi access point, until we manually reboot the device.
Here is my start.sh script:
$ cat /usr/src/app/start.sh
#!/usr/bin/env bash
export DBUS_SYSTEM_BUS_ADDRESS=unix:path=/host/run/dbus/system_bus_socket
while true
do
iwgetid -r
if [ $? -ne 0 ]; then
printf 'Starting WiFi Connect\n'
./wifi-connect --portal-ssid helioslite-tracker --portal-passphrase XXX --activity-timeout 600
fi
sleep 10
done
Here are the logs of NetworkManager service:
https://gist.github.com/bbinet/d0ed2e5465824ef2594589b383cc5f93#file-networkmanager-log
Here are the logs of the container that runs wifi-connect:
https://gist.github.com/bbinet/d0ed2e5465824ef2594589b383cc5f93#file-wifi-connect-log
I'm happy to provide more details as I can connect to the rpi through the wifi-connect access point, so if think some other logs might be helpful to investigate this issue, please tell me.
Also do you know a temporary workaround that I can run to force network manager to reconnect to my initial wifi access point, so that I don't need to hard reboot the device?
Hi, I will check the logs now and let you know if I find anything.
The wifi-connect logs definitely reveal an issue: it should not start the access point when no SSIDs are available.
The empty SSIDs list is another issue that we need to look into. I am thinking about elevating the log level of NetworkManager and wpa_supplicant, and running some diagnostic code that we can try a few things with it. I will do some research tomorrow on this and let you know how that goes.
I had a non-related to WiFi Connect report about RPi 3 and NetworkManager not reporting any networks, so I am definitely interested in figuring out how to workaround this issue without reboots or driver reloads.
Ok, thanks
@bbinet I started implementing this here: https://github.com/resin-io-playground/networkmanager-empty-ssids
It elevates the log levels of NetworkManager and wpa_supplicant, and then starts to check whether the SSID list is empty each 30 seconds. If the list is empty it triggers a rescan and waits 10 seconds to check whether they are available.
I am going to do local tests with it tomorrow and hopefully I will be able to reproduce an empty list. I am thinking to try it using a router and also a hotspot on my phone.
If I am not able to reproduce, I may ask you to run the script on a device that has this issue. But let's see first what will happen with the local testing. I will keep you posted.
Sure, I'm also running your ssids.py
script on my rpi zero w since this morning, so I let you know the results too.
I was able to reproduce zero access points available locally. The RequestScan method brought back the access point list, which is good news. I will check the logs now and do more testing.
I've also been able to reproduce the issue.
Here are the logs of the ssids.py script:
https://gist.github.com/bbinet/e53e33236a9862bb2c70a4c6de4bbf36#file-ssids-log
Here are the logs of NetworkManager service:
https://gist.github.com/bbinet/e53e33236a9862bb2c70a4c6de4bbf36#file-nm-log
Here are the logs of the container that runs wifi-connect:
https://gist.github.com/bbinet/e53e33236a9862bb2c70a4c6de4bbf36#file-container-log
@bbinet that is really useful! In your case the RequestScan call did not bring back the list with SSIDs, which is more like the actual problem, than what I was able to reproduce here.
I still have not got to chance to check the NM logs you attached for the specifics, but I will do that first thing tomorrow and will follow up with you. Thanks again!
@bbinet I did different can on the instrumentation script: if the RequestScan does not work, it now tells NM to unmanage and then manage the device again. If that does not work, it then restarts the NM service. I also added timestamped logging.
Can you please run this again, but this time do not start WiFi Connect together with the script (assuming that the connection profile is defined)? Also you may reboot the device after the application update, so that we get only fresher logs.
If you reproduce something, please attach the logs of wpa_supplicant as well. Thanks!
@majorz but if I don't run WiFi Connect together with the script, I won't be able to log in to get the logs...
Why not run WiFi Connect together with the script?
Actually now on second thought, you may leave it running, it should not be a problem.
@bbinet one thing I forgot to mention is that I changed the implementation to use libnm as a higher level alternative to using dbus directly, so the Dockerfile has changed.
@bbinet I found an issue with this version and applied a fix. Can you please run the latest version instead?
I also made the project to use docker-compose for easier local testing, but the old structure is preserved in the ssids directory.
Traceback (most recent call last):
File "/ssids.py", line 8, in <module>
gi.require_version("NM", "1.0")
File "/usr/lib/python3/dist-packages/gi/__init__.py", line 118, in require_version
raise ValueError('Namespace %s not available' % namespace)
ValueError: Namespace NM not available
(I've installed python3-gi, but it may not be sufficient?)
It will need libnm-dev as well for this to work. I am using a Debian Buster base image, so that it is more recent and compatible with the one running on the host OS.
@majorz the new ssids.py script is now running (still on debian jessie image though).
I hope I'll be able to reproduce the issue this afternoon, as I'm leaving tonight for 5 days off.
Sorry, I meant Debian Stretch (not jessie)
Thanks! I think it should be fine. I am running it locally as well, but not sure whether I will be able to reproduce. I will run it on a few other devices as well later on. Good luck!
I made some adjustments on the diagnostic script, as there is a caching bug with python-gir, which led to incorrect reporting of access point count.
I'm back at work today, so I've updated the ssids.py script, but I have some encoding issues:
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
File "/ssids.py", line 191, in <module>
main()
File "/ssids.py", line 128, in main
set_nm_log_level("trace")
File "/ssids.py", line 29, in set_nm_log_level
debug("Setting NetworkManager log level to", level)
File "/ssids.py", line 24, in debug
LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Setting NetworkManager log level to trace'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
File "/ssids.py", line 191, in <module>
main()
File "/ssids.py", line 129, in main
set_wpa_log_level("msgdump")
File "/ssids.py", line 34, in set_wpa_log_level
debug("Setting wpa_supplicant log level to", level)
File "/ssids.py", line 24, in debug
LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Setting wpa_supplicant log level to msgdump'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
File "/ssids.py", line 191, in <module>
main()
File "/ssids.py", line 131, in main
print_device_info()
File "/ssids.py", line 86, in print_device_info
debug("Device:", device.get_iface())
File "/ssids.py", line 24, in debug
LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Device: wlan0'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
File "/ssids.py", line 191, in <module>
main()
File "/ssids.py", line 131, in main
print_device_info()
File "/ssids.py", line 87, in print_device_info
debug("Driver:", device.get_driver())
File "/ssids.py", line 24, in debug
LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Driver: brcmfmac'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
File "/ssids.py", line 191, in <module>
main()
File "/ssids.py", line 131, in main
print_device_info()
File "/ssids.py", line 88, in print_device_info
debug("Driver version:", device.get_driver_version())
File "/ssids.py", line 24, in debug
LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Driver version: 7.45.41.46'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
File "/ssids.py", line 191, in <module>
main()
File "/ssids.py", line 131, in main
print_device_info()
File "/ssids.py", line 89, in print_device_info
debug("Firmware version:", device.get_firmware_version())
File "/ssids.py", line 24, in debug
LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Firmware version: 01-f8a78378'
Arguments: ()
--- Logging error ---
Traceback (most recent call last):
File "/usr/lib/python3.5/logging/__init__.py", line 983, in emit
stream.write(msg)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2014' in position 24: ordinal not in range(128)
Call stack:
File "/ssids.py", line 191, in <module>
main()
File "/ssids.py", line 134, in main
wait(10)
File "/ssids.py", line 119, in wait
debug("Sleeping", seconds, "seconds...")
File "/ssids.py", line 24, in debug
LOGGER.debug(" ".join([str(arg) for arg in args]))
Message: 'Sleeping 10 seconds...'
Arguments: ()
It seems to be fixed if I set the env variable: PYTHONIOENCODING=utf-8
In my testing here with RPi 3 I also reproduced multiple times a scenario where 0 access points were reported by NetworkManager. However NM detected this and triggered a RequestScan which brought back the list with access points.
I will test today with a RPi 1, since I do not have a zero w, but it has the same wifi chip, so let's see what happens.
I got confused, the RPi 1 does not have a built-in wireless chip. RPi 3 B has the same wireless chip as Zero W.
@majorz my rpi zero w is currently running ssids.py and I'll drop you a message as soon as I've reproduced the issue.
Fingers crossed!
I am thinking about the following modifications in WiFi Connect:
-
Always trigger RequestScan before retreiving the list of access points. Since with the current implementation there is not a way to subscribe for the event where the scan has completed, I will do some experimentation and see how to do it best with sleeping. Later on when we use libnm this will change. I did a few days ago the first steps towards this: https://github.com/resin-io-playground/libnm-rs. libnm is the NetworkManager official library. nmcli is using it and it handles all D-Bus communication itself + adds useful utility functions on top. Another nice thing about it is that it provides GObject introspection, so the Rust bindings will be autogenerated for the most part. The Rust autogeneration is not perfect, so I will have to do manual stuff as well unfortunately, but hopefully that will not be much effort.
-
If a list with access points is not available then restart NetworkManager from inside WiFi Connect (if our findings confirm that this will fix the issue).
@majorz this sounds good.
Also, I was able to reproduce the issue this night, but sadly I had used journalctl
--vacuum-size
and --vacuum-time
option to try to purge old logs yesterday, and it seems journalctl also purged the logs from the night, so you might not find the relevant logs in the files attached below.
I will restart completely my rpi, and run ssids.py script again...
Here are the logs of the ssids.py script:
https://gist.github.com/bbinet/aa1c3260f5756b31f20c37276d61ba27#file-ssids-log
Here are the logs of NetworkManager service:
https://gist.github.com/bbinet/aa1c3260f5756b31f20c37276d61ba27#file-nm-log
Here are the logs of wpa_supplicant service:
https://gist.github.com/bbinet/aa1c3260f5756b31f20c37276d61ba27#file-wpa_supplicant-log
Here are the logs of the container that runs wifi-connect:
https://gist.github.com/bbinet/aa1c3260f5756b31f20c37276d61ba27#file-container-log
Ooops, I left a cleanup()
call in the middle and forgot to remove it. The logs revealed that. I will fix it now.
@bbinet Done! Please run the new version as the previous one did not attempt to unmanaged the device or restart NM. I also added PYTHONIOENCODING in the Dockerfile. And also something in the compose since wifi-connect's dnsmasq needs it (although the last one should not have affected you).
@majorz Thanks, I've just restarted my rpi with the new ssids.py script. Let's wait for the issue to show up again.
Sorry I was not able to reproduce the issue for 10 days...
@bbinet alright, I will include the changes we talked above with invoking RequestScan on start and also restarting NetworkManager once if access points are not available.
@majorz Any news on this issue? I'm still getting empty list of wifi access points from time to time: if I manually restart NetworkManager from the hostos then it works again.
No much progress on this issue yet. I am planning to address it till the end of the month.
I'm still facing this issue from time to time: this is quite painful as we cannot connect to the device until the next reboot (though we have configured a force reboot every night to workaround this issue).
@majorz Any news on this issue and the wifi-connect project in general, as the development seems to have stalled these days?
There is a new ui
branch that should be integrated as soon as I can. And also we will be basing the project on this new Rust binding library which I worked on: https://github.com/resin-io-modules/libnm-rs.
I am going to work on this once we release this bigger update, since I would like to depend on the new library for the scanning notification mechanisms.
Also we are going to integrate WiFi Connect in Resin OS itself. We are figuring the exact details currently.
Ok, thanks for the update, and for your efforts in this project.
I'm still seeing this issue, any idea for a fix?
@andrewjaykeller I left you a comment in the other thread here (about the same issue): #232 (comment)