[BUG] Uptime stops updating and APs stop scanning once bettercap fails
matrix224 opened this issue · 1 comments
I want to preface this by saying I literally just started working with this a day ago. I by no means fully understand it, but felt it may have been worthwhile to report here in case it's unknown, or at the very least, someone who has more knowledge on this can see it and do what may be appropriate here.
Describe the bug
After the pwnagotchi has been running for a little while (time can vary greatly: sometimes 5 minutes, other times 15 - 30 minutes, usually the latter), the uptime on the top right stops changing.
It also appeared that once this happens, the unit stopped playing with the APs around me. I understand eventually it stops once there's no new ones around, but it seemed to be correlated so that once the uptime stopped, so did the actual AP scanning.
From my testing, I believe this is related to something going wrong with bettercap, resulting in certain processing tasks to stop functioning.
To Reproduce
Steps to reproduce the behavior (I apologize I have no real specific steps):
- Turn the pwnagotchi on and let it do its thing in auto mode
- Eventually (typically 15+ minutes) after sitting in the same spot, it stops updating the uptime and also seems to stop scanning APs once this happens
Expected behavior
I would expect the uptime to keep updating and the unit to keep scanning until it has truly exhausted itself off the current available APs.
Screenshots
No screenshots, but see additional context below for more info
Environment:
- Aluminum-ice 1.8.5
- Pi Zero 2
Additional context
Once I looked into this a little more, I started to play around with it to see if I could find what's going on (if this was intended, if I configured it wrong, etc).
I determined that the UI is getting updated by agent.py
, so I started by added logging in agent.py
in the _update_uptime
method to just print every time it fires. I noticed that once the uptime stopped updating on the unit, that logging was no longer getting printed as well. Right after the logging stopped, there were a bunch of log messages relating to bettercap not being able to connect:
[2024-05-01 18:11:40,841] [WARNING] ping error - retrying connection in 2.892809119803989 sec
[2024-05-01 18:11:43,747] [INFO] creating new websocket...
[2024-05-01 18:11:43,754] [WARNING] connection to the bettercap endpoint failed...
[2024-05-01 18:11:43,755] [WARNING] retrying connection in 0.9430889215199979 sec
[2024-05-01 18:11:44,702] [INFO] creating new websocket...
[2024-05-01 18:11:44,709] [WARNING] connection to the bettercap endpoint failed...
[2024-05-01 18:11:44,710] [WARNING] retrying connection in 1.8519713382987848 sec
[2024-05-01 18:11:46,565] [INFO] creating new websocket...
[2024-05-01 18:11:46,574] [WARNING] connection to the bettercap endpoint failed...
[2024-05-01 18:11:46,575] [WARNING] retrying connection in 3.6335060342079784 sec
[2024-05-01 18:11:50,214] [INFO] creating new websocket...
[2024-05-01 18:11:50,222] [WARNING] connection to the bettercap endpoint failed...
[2024-05-01 18:11:50,223] [WARNING] retrying connection in 3.958611552908133 sec
[2024-05-01 18:11:52,079] [INFO] CHANNEL 3
[2024-05-01 18:11:52,094] [WARNING] can't run my request... connection to the bettercap endpoint failed...
This went on for a little bit and once those messages stopped, all I saw were the AI log messages about new epochs happening and it adjusting the settings, but nothing at all was actually happening between those messages being printed. It was just cycling epochs without doing anything else.
The _update_uptime
method gets called alongside a few other methods in a separate thread in a while loop every 1 second (_fetch_stats
method), so my hunch was that maybe whatever happens with bettercap throws an exception and ends up killing the thread.
I then added a try-catch around the method calls in _fetch_stats
and let it go again. About 30 minutes in, I noticed the following in my log:
[2024-05-01 19:34:36,310] [WARNING] ping error - retrying connection in 4.140842904161325 sec
[2024-05-01 19:34:37,266] [ERROR] FETCH_STATS ERR: HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /api/session (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x470d9210>: Failed to establish a new connection: [Errno 111] Connection refused'))
[2024-05-01 19:34:56,524] [INFO] creating new websocket...
[2024-05-01 19:34:56,532] [WARNING] connection to the bettercap endpoint failed...
[2024-05-01 19:34:56,533] [WARNING] retrying connection in 2.3906160655753106 sec
[2024-05-01 19:34:56,573] [ERROR] FETCH_STATS ERR: HTTPConnectionPool(host='localhost', port=8081): Max retries exceeded with url: /api/session (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x41a3c9f0>: Failed to establish a new connection: [Errno 111] Connection refused'))
The "FETCH_STATS ERR" lines are mine of course.
This went on for a little bit (15 - 20 seconds or so) where it printed these errors, but then seemed to stop as it recovered and was able to connect to bettercap again. Once that happened, my logging from _update_uptime
resumed, and the uptime began updating once again.
I believe the actual error in this scenario originates from the call to session
in the _fetch_stats
method.
Update:
While testing more with my changes mentioned above, I've noticed that it always recovers as expected for uptime, however AP scanning seems to still stop after bettercap dies for a little bit. I'm unsure exactly what is going on there as it requires more digging, but figured it would be worth mentioning since it seems like it's in the same vein and was related to my original issue. I had originally thought the change I made above fixed this as well, but that is not the case
After a few days of playing around with this more, I've come to a few more conclusions which I'll summarize:
- My initial code change I mentioned above still works as expected for keeping the uptime working
- The code change does not fix the issue of the device going blind after bettercap dies
- I had initially thought that it continued scanning and then rebooted after it exhausted all APs, but that was not the case; it had just gone blind, and I was too new to this to understand that
- Bettercap seems to crash randomly and somewhat often, which I noted by checking
/var/log/syslog
- The most common occurrence seems to be instances of
fatal error: concurrent map iteration and map write
, which from what I've gathered is kind of an infamous issue that's been around forever. There are other instances of panics (e.g.panic: runtime error: index out of range [12] with length 12 [recovered]
) as well, which also seems par for the course from what I've read - I am not familiar with Go, so I haven't attempted to actually look into these within bettercap
- The most common occurrence seems to be instances of
- The service gets restarted after 30s once it crashes thanks to
/etc/systemd/system/bettercap.service
With all of this, I took another crack at seeing if I could get the device to recover from going blind after bettercap dies, and so far it has seemed promising. Before I made my latest changes, my pwnagotchi would only last around an hour or so (2 hours tops if I was lucky) before resetting due to being blind for 50 epochs. Currently though, my pwnagotchi has been running over 14 hours and is still hitting APs it comes across, though after sitting in the same spot for this long, it truly is starting to run out of APs to hit now. This is after bettercap has died and restarted numerous times as well, with only two or three instances of the AI being considered blind during this time; and even then, it was only for 1 epoch before it came back and was scanning again.
I forked the project and put my changes in there to make it easier to see what I actually did: here
The only file I changed was agent.py
To be clear, I still wouldn't consider myself fully attuned to this project and all the caveats it has. My hope is that if there's any actual substance / usefulness to this change, it can be used as-is or as a basis for making similar changes to get things fixed properly. Alternatively, maybe I'm just completely wrong and this works out of pure luck or something, and someone can tell me so.
To summarize those changes:
- I put a try/catch around the call to
session
in_fetch_stats
to prevent that thread from getting killed when bettercap dies - If
session
fails, it assumes bettercap had an issue and sets a boolean there, and doesn't proceed to call the other update methods (e.g._update_uptime
) - Once bettercap recovers, it goes into the else and checks if bettercap had previously failed and calls the update methods
- If bettercap did fail before, then it will call
setup_events
andstart_monitor_mode
again - My line of thinking here was that these methods set various parameters through the
run
command, which as far as I know, is basically telling the bettercap service to set those values in it. If the service has crashed and restarted though, those parameters probably need to be set again, right? (or at a minimum, I would expect the values that get set insetup_events
and some of the values from_reset_wifi_settings
[especiallywifi.interface
], which gets called throughstart_monitor_mode
, to be required for it to work properly)
- If bettercap did fail before, then it will call
Outside of this change, I only made one other change to my system, which I just did for preference; I don't think it has any bearing on the actual code change.
I updated /etc/systemd/system/bettercap.service
so the restart time was 10s instead of 30s