[BUG] Observation space mismatch throws exception that deletes saved model brain.nn
ZachDolph opened this issue · 5 comments
Description
An error during initialization related to an observation space mismatch throws an exception that than deletes the saved model (brain.nn). It's not entirely clear where the observation space issue stems from, but having just lost a few weeks worth of training 24 hours a day, I have to say that this feels like a bug rather than a feature and doesn't seem like the best course of action given any exception. I'd think this should be handled via the "Fix_Services" plugin by trying a restart first or something, because this seems to have had happened for absolutely no reason after it was rebooted by the Fix_Services plugin. At the very least, we should make a copy of the previous brain.nn model save instead of permanently rm'ing it to the ether.
To Reproduce
It's not clear what steps are needed to repro, but in my case (see logs further below):
- After Fix_Services plugin reboots due to pwnagotchi error, watch pwnlog for AI loading phase.
- If any exceptions occur during loading phase, you'll see the log stating the brain.nn model save has been deleted.
- If no exceptions occur, you can try to force it to happen (can't confirm this repro every time) by killing the three services bettercap, pwngrid-peer, pwnagotchi, and then starting them one by one individually.
Expected behavior
Either we need a more robust implementation for handling exceptions during the init phase which doesn't cause important saved model files to be deleted and restarted. If not, then we should make a copy of the brain.nn model file before removing it, in case there was an unrelated error causing the exception (in my case it was the observation space mismatch).
Logs
21:09:33 [INFO] found monitor interface: wlan0mon
21:09:33 [INFO] supported channels: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 36, 40, 44, 48, 52, 56, 60, 64, 100, 104, 108, 112, 116, 120, 124, 128, 132, 136, 140, 149, 153]
21:09:33 [INFO] handshakes will be collected inside /root/handshakes
21:09:33 [INFO] [bettercap] creating new websocket...
21:09:33 [INFO] [epoch 0] duration=00:00:03 slept_for=00:00:00 blind=0 sad=0 bored=0 inactive=1 active=0 peers=0 tot_bond=0.00 avg_bond=0.00 hops=0 missed=0 deauths=0 assocs=0 handshakes=0 cpu=52% mem=20% temperature=36C reward=-0.2
21:09:36 [INFO] [Fix_Services ip link show wlan0mon]: b'4: wlan0mon: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000\n link/ieee802.11/radiotap brd ff:ff:ff:ff:ff:ff\n'
21:09:36 [INFO] wlan0mon is up.
21:09:36 [INFO] [Fix_Services] Logs look good!
21:09:38 [WARNING] !!! captured new handshake on channel 6, -82 dBm: 74:40:be:75:b8:54 () -> SonarBG [a8:b0:88:48:50:88 (eero inc.)] !!!
21:09:52 [INFO] [AI] creating model ...
21:09:52 [INFO] [AI] loading /root/brain.nn ...
21:09:52 [ERROR] [AI] error while starting AI (Observation spaces do not match: Box(0.0, 1.0, (1, 428), float32) != Box(0.0, 1.0, (1, 503), float32))
(most recent call last):
File "/usr/local/lib/python3.9/dist-packages/pwnagotchi/ai/__init__.py", line 51, in load
a2c.load(config['path'], env)
File "/usr/local/lib/python3.9/dist-packages/stable_baselines3/common/base_class.py", line 716, in load
check_for_correct_spaces(env, data["observation_space"], data["action_space"])
File "/usr/local/lib/python3.9/dist-packages/stable_baselines3/common/utils.py", line 229, in check_for_correct_spaces
raise ValueError(f"Observation spaces do not match: {observation_space} != {env.observation_space}")
Observation spaces do not match: Box(0.0, 1.0, (1, 428), float32) != Box(0.0, 1.0, (1, 503), float32)
21:09:52 [INFO] [AI] Deleting brain and restarting.
21:09:56 [INFO] [Fix_Services] plugin loaded.
21:09:56 [INFO] Logtail plugin loaded.
Environmen:
- Pwnagotchi version: 2.5.1
- OS version: Current image 2.5.4
- Type of hardware: Raspberry Pi 3B+
There should be a brain.nn.bak in /root, you can try to restore that manually if it happens and let me know if that works.
if os.path.exists("/root/brain.nn.bak"):
os.system("rm /root/brain.nn")
os.system("rm /root/brain.json")
shutil.copy2("/root/brain.nn.bak", "/root/brain.nn")
shututil.copy2("/root/brain.json.bak", "/root/brain.json")
os.system("service pwnagotchi restart")
else:
os.system("rm /root/brain.nn /root/brain.json && service pwnagotchi restart")
@ZachDolph have you tried replacing the brain.nn with the brain.nn.bak file? I can implement these changes, hoping it helps.
Ah my bad, I actually hadn't noticed that the backup existed, and that it had fixed itself using the backup. I can confirm this by the number of epochs shown in brain.json. I appreciate your help! I'll close this one out. Thanks for your work on the project by the way.
Okay, so no need to "fix" anything?