mirage/qubes-mirage-firewall

Firewall crashes when starting AppVMs

talex5 opened this issue · 7 comments

This looks the same as the problem in #96 (comment), but this time reproduced using the final 0.7 release binary (hash 4f4456b5fe7c8ae1ba2f6934cf89749cf6aae9a90cce899cf744c89d311467a3) and with a 64MB memory allocation. I haven't seen this happen before #96 was merged, and it has now happened twice in three days.

xl dmesg shows:

(XEN) mwait-idle: max C-state count of 8 reached
(XEN) Pagetable walk from 0000000000242f58:
(XEN)  L4[0x000] = 0000000026809067 0000000000000409
(XEN)  L3[0x000] = 000000002680a067 000000000000040a
(XEN)  L2[0x001] = 000000002680c067 000000000000040c 
(XEN)  L1[0x042] = 0010000027042025 0000000000000242
(XEN) domain_crash_sync called from entry.S: fault at ffff82d08020fc3e entry.o#create_bounce_frame+0x67/0x142
(XEN) Domain 8 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.8.5-16.fc25  x86_64  debug=n   Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e033:[<0000000000172d18>]
(XEN) RFLAGS: 0000000000010202   EM: 1   CONTEXT: pv guest (d8v0)
(XEN) rax: 0000000000000000   rbx: 000000000039f2a9   rcx: 0000000000000010
(XEN) rdx: 0000000000000000   rsi: 000000000016aca8   rdi: 00000000001fb341
(XEN) rbp: 00000000001fb317   rsp: 0000000000242f68   r8:  00000000ffffffff
(XEN) r9:  00000000ffffffff   r10: 0000000000000000   r11: 0000000000000440
(XEN) r12: 0000000000243088   r13: 000000000039f69f   r14: 00000000002432c0
(XEN) r15: 0000000000381780   cr0: 0000000080050033   cr4: 00000000003426e0
(XEN) cr3: 0000000026808000   cr2: 0000000000242f90
(XEN) fsb: 0000000000000000   gsb: 000000000039f6b0   gss: 0000000000000000
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=0000000000242f68:
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000000
(XEN)    0000000000000000 0000000000000000 0000000000000000 0000000000000bf8
(XEN)    0000000000243040 ffffffffffffffff 000000000039f2a9 0000000000175fb4
(XEN)    02000079726f6d65 0000000300000bf8 0000000000243070 000000000039f2a9
(XEN)    00000000001fb34c 00000000001f5938 000000000039f69f 000000000017627f
(XEN)    0000000000381780 00000000002432c0 000000000039f69f 00000000001f5938
(XEN)    00000000001fb34c 000000000039f2a9 0000000000000440 0000000000000000
(XEN)    00000000ffffffff 00000000ffffffff 0000000000000000 0000000000000010
(XEN)    000000000016aca8 000000000039f69f 000000000039f2a9 ffffffffffffffff
(XEN)    000000000016aca8 000000010000e030 0000000000010012 0000000000243138
(XEN)    000000000000e02b 6e6f697369766944 6f72657a5f79625f 0700000000000000
(XEN)    0000000000000bf8 0000000000243168 fffffffffffffff3 0000000000000bfc
(XEN)    6e756f665f746f4e 0600000000000064 000000007a797877 0000000000243198
(XEN)    fffffffffffffff1 0000000000000bfc 61665f686374614d 0200006572756c69
(XEN)    0000000000000bf8 00000000002431c8 ffffffffffffffef 0000000000000bfc
(XEN)    766f5f6b63617453 0100776f6c667265 0000000000000bf8 00000000002431f8

The last thing shown in the guest log was:

2020-05-19 23:41:17 -00:00: INF [client_net] add client vif {domid=10;device_id=0} with IP 10.138.3.46
2020-05-19 23:41:17 -00:00: INF [qubes.db] got update: "/mapped-ip/10.138.3.46/visible-ip" = "10.138.3.46"
2020-05-19 23:41:17 -00:00: INF [qubes.db] got update: "/mapped-ip/10.138.3.46/visible-gateway" = "10.137.0.16"
2020-05-19 23:41:17 -00:00: INF [qubes.db] got update: "/qubes-firewall/10.138.3.46/0000" = "action=accept"
2020-05-19 23:41:17 -00:00: INF [qubes.db] got update: "/qubes-firewall/10.138.3.46/policy" = "drop"
2020-05-19 23:41:17 -00:00: INF [qubes.db] got update: "/qubes-firewall/10.138.3.46" = ""
2020-05-19 23:41:20 -00:00: INF [client_net] Client 10 (IP: 10.138.3.46) ready
2020-05-19 23:41:20 -00:00: INF [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
linse commented

Thanks for reporting this, I'd love to learn more about it.
Is there a way we can reproduce this bug? Can you log incoming network packages before the crash?

I added some logging around setting up new clients after the last crash, but it hasn't happened again since then :-/

OK, it crashed again. Here's the debugging I added:

https://github.com/mirage/qubes-mirage-firewall/compare/master...talex5:debug?expand=1

This was the last thing in the log:

2020-05-26 03:12:40 -00:00: INF [client_net] Client 9 (IP: 10.137.0.12) ready
2020-05-26 03:12:40 -00:00: INF [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
2020-05-26 03:12:40 -00:00: INF [client_net] Running qubesdb_updater thread...
2020-05-26 03:12:40 -00:00: INF [client_net] Getting rules...
2020-05-26 03:12:40 -00:00: INF [client_net] New firewall rules for 10.137.0.12
0   any   accept

In particular, it didn't get as far as Router.add_client.

I added some more debugging:

let remove_connections t ports ip =
  Log.info (fun f -> f "remove_connections: enter");
  let freed_ports = Nat.remove_connections t.table ip in
  Log.info (fun f -> f "tcp");

Now after it crashed the log ended with:

2020-05-29 04:45:31 -00:00: INF [client_net] Client 7 (IP: 10.137.0.8) ready
2020-05-29 04:45:31 -00:00: INF [ethernet] Connected Ethernet interface fe:ff:ff:ff:ff:ff
2020-05-29 04:45:31 -00:00: INF [client_net] Running qubesdb_updater thread...
2020-05-29 04:45:31 -00:00: INF [client_net] calling got_new_commit
2020-05-29 04:45:31 -00:00: INF [client_net] setting rules
2020-05-29 04:45:31 -00:00: INF [client_net] Getting rules...
2020-05-29 04:45:31 -00:00: INF [client_net] New firewall rules for 10.137.0.8
0   any   accept
2020-05-29 04:45:31 -00:00: INF [client_net] remove_connections
2020-05-29 04:45:31 -00:00: INF [my-nat] remove_connections: enter

So I guess the problem is in Nat.remove_connections. Perhaps this uses a lot of stack when there are many connections?

linse commented

Hi @talex5 , thanks for the detailed log output, very helpful! We looked into remove_connections and were able to reduce the stack size.
Can you try pinning git+http://github.com/mirage/mirage-nat.git#no-stack-overflow and see if it fixes your problem?

OK, testing now with RUN opam pin -n mirage-nat 'https://github.com/mirage/mirage-nat.git#no-stack-overflow' added to the Dockerfile.

Seems to be working fine now - thanks!