Firewall crashes when starting AppVMs
talex5 opened this issue · 7 comments
This looks the same as the problem in #96 (comment), but this time reproduced using the final 0.7 release binary (hash 4f4456b5fe7c8ae1ba2f6934cf89749cf6aae9a90cce899cf744c89d311467a3) and with a 64MB memory allocation. I haven't seen this happen before #96 was merged, and it has now happened twice in three days.
xl dmesg
shows:
(XEN) mwait-idle: max C-state count of 8 reached
(XEN) Pagetable walk from 0000000000242f58:
(XEN) L4[0x000] = 0000000026809067 0000000000000409
(XEN) L3[0x000] = 000000002680a067 000000000000040a
(XEN) L2[0x001] = 000000002680c067 000000000000040c
(XEN) L1[0x042] = 0010000027042025 0000000000000242
(XEN) domain_crash_sync called from entry.S: fault at ffff82d08020fc3e entry.o#create_bounce_frame+0x67/0x142
(XEN) Domain 8 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.8.5-16.fc25 x86_64 debug=n Not tainted ]----
(XEN) CPU: 0
(XEN) RIP: e033:[<0000000000172d18>]
(XEN) RFLAGS: 0000000000010202 EM: 1 CONTEXT: pv guest (d8v0)
(XEN) rax: 0000000000000000 rbx: 000000000039f2a9 rcx: 0000000000000010
(XEN) rdx: 0000000000000000 rsi: 000000000016aca8 rdi: 00000000001fb341
(XEN) rbp: 00000000001fb317 rsp: 0000000000242f68 r8: 00000000ffffffff
(XEN) r9: 00000000ffffffff r10: 0000000000000000 r11: 0000000000000440
(XEN) r12: 0000000000243088 r13: 000000000039f69f r14: 00000000002432c0
(XEN) r15: 0000000000381780 cr0: 0000000080050033 cr4: 00000000003426e0
(XEN) cr3: 0000000026808000 cr2: 0000000000242f90
(XEN) fsb: 0000000000000000 gsb: 000000000039f6b0 gss: 0000000000000000
(XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033
(XEN) Guest stack trace from rsp=0000000000242f68:
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN) 0000000000000000 0000000000000000 0000000000000000 0000000000000bf8
(XEN) 0000000000243040 ffffffffffffffff 000000000039f2a9 0000000000175fb4
(XEN) 02000079726f6d65 0000000300000bf8 0000000000243070 000000000039f2a9
(XEN) 00000000001fb34c 00000000001f5938 000000000039f69f 000000000017627f
(XEN) 0000000000381780 00000000002432c0 000000000039f69f 00000000001f5938
(XEN) 00000000001fb34c 000000000039f2a9 0000000000000440 0000000000000000
(XEN) 00000000ffffffff 00000000ffffffff 0000000000000000 0000000000000010
(XEN) 000000000016aca8 000000000039f69f 000000000039f2a9 ffffffffffffffff
(XEN) 000000000016aca8 000000010000e030 0000000000010012 0000000000243138
(XEN) 000000000000e02b 6e6f697369766944 6f72657a5f79625f 0700000000000000
(XEN) 0000000000000bf8 0000000000243168 fffffffffffffff3 0000000000000bfc
(XEN) 6e756f665f746f4e 0600000000000064 000000007a797877 0000000000243198
(XEN) fffffffffffffff1 0000000000000bfc 61665f686374614d 0200006572756c69
(XEN) 0000000000000bf8 00000000002431c8 ffffffffffffffef 0000000000000bfc
(XEN) 766f5f6b63617453 0100776f6c667265 0000000000000bf8 00000000002431f8
The last thing shown in the guest log was:
2020-05-19 23:41:17 -00:00: INF [client_net] add client vif {domid=10;device_id=0} with IP 10.138.3.46
2020-05-19 23:41:17 -00:00: INF [qubes.db] got update: "/mapped-ip/10.138.3.46/visible-ip" = "10.138.3.46"
2020-05-19 23:41:17 -00:00: INF [qubes.db] got update: "/mapped-ip/10.138.3.46/visible-gateway" = "10.137.0.16"
2020-05-19 23:41:17 -00:00: INF [qubes.db] got update: "/qubes-firewall/10.138.3.46/0000" = "action=accept"
2020-05-19 23:41:17 -00:00: INF [qubes.db] got update: "/qubes-firewall/10.138.3.46/policy" = "drop"
2020-05-19 23:41:17 -00:00: INF [qubes.db] got update: "/qubes-firewall/10.138.3.46" = ""
2020-05-19 23:41:20 -00:00: INF [client_net] Client 10 (IP: 10.138.3.46) ready
2020-05-19 23:41:20 -00:00: INF [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
Thanks for reporting this, I'd love to learn more about it.
Is there a way we can reproduce this bug? Can you log incoming network packages before the crash?
I added some logging around setting up new clients after the last crash, but it hasn't happened again since then :-/
OK, it crashed again. Here's the debugging I added:
https://github.com/mirage/qubes-mirage-firewall/compare/master...talex5:debug?expand=1
This was the last thing in the log:
2020-05-26 03:12:40 -00:00: INF [client_net] Client 9 (IP: 10.137.0.12) ready
2020-05-26 03:12:40 -00:00: INF [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
2020-05-26 03:12:40 -00:00: INF [client_net] Running qubesdb_updater thread...
2020-05-26 03:12:40 -00:00: INF [client_net] Getting rules...
2020-05-26 03:12:40 -00:00: INF [client_net] New firewall rules for 10.137.0.12
0 any accept
In particular, it didn't get as far as Router.add_client
.
I added some more debugging:
let remove_connections t ports ip =
Log.info (fun f -> f "remove_connections: enter");
let freed_ports = Nat.remove_connections t.table ip in
Log.info (fun f -> f "tcp");
Now after it crashed the log ended with:
2020-05-29 04:45:31 -00:00: INF [client_net] Client 7 (IP: 10.137.0.8) ready
2020-05-29 04:45:31 -00:00: INF [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
2020-05-29 04:45:31 -00:00: INF [client_net] Running qubesdb_updater thread...
2020-05-29 04:45:31 -00:00: INF [client_net] calling got_new_commit
2020-05-29 04:45:31 -00:00: INF [client_net] setting rules
2020-05-29 04:45:31 -00:00: INF [client_net] Getting rules...
2020-05-29 04:45:31 -00:00: INF [client_net] New firewall rules for 10.137.0.8
0 any accept
2020-05-29 04:45:31 -00:00: INF [client_net] remove_connections
2020-05-29 04:45:31 -00:00: INF [my-nat] remove_connections: enter
So I guess the problem is in Nat.remove_connections
. Perhaps this uses a lot of stack when there are many connections?
Hi @talex5 , thanks for the detailed log output, very helpful! We looked into remove_connections
and were able to reduce the stack size.
Can you try pinning git+http://github.com/mirage/mirage-nat.git#no-stack-overflow
and see if it fixes your problem?
OK, testing now with RUN opam pin -n mirage-nat 'https://github.com/mirage/mirage-nat.git#no-stack-overflow'
added to the Dockerfile
.
Seems to be working fine now - thanks!