firecracker-microvm/firecracker

Can't resume vm that runs docker in it

BasToTheMax opened this issue · 3 comments

Hello.

I am having issues while trying to resume my vm.

What I am doing:

  • Create vmA
  • Make a Full snapshot of vmA
  • Destroy vmA
  • Start vmB by loading the snapshot (diff's enabled)
  • Create a Diff snapshot
  • Destroy vmB
  • Merge the Full and the Diff snapshot. (command below)
  • Start vmC by loading the newly created (merged) snapshot.
  • Firecracker crashes with the error below:

Please note: I am running docker in my vm

Snapshot command:

./snapshot-editor edit-memory rebase \
     --memory-path ./snap/mem1 \
     --diff-path ./snap/mem2

Firecracker logs:

2023-12-25T17:24:28.369578016 [anonymous-instance:fc_api:INFO:src/api_server/src/parsed_request.rs:163] The request was executed successfully. Status code: 204 No Content.
2023-12-25T17:24:28.387220678 [anonymous-instance:fc_api:INFO:src/api_server/src/parsed_request.rs:70] The API server received a Put request on "/snapshot/load" with body "{\n            \"snapshot_path\": \"./snap/snap1\",\n            \"mem_file_path\": \"./snap/mem1\",\n            \"enable_diff_snapshots\": true,\n            \"resume_vm\": true\n    }".
2023-12-25T17:24:28.387596935 [anonymous-instance:main:WARN:src/vmm/src/logger/mod.rs:33] [DevPreview] Virtual machine snapshots is in development preview.
2023-12-25T17:24:28.387873552 [anonymous-instance:main:INFO:src/vmm/src/persist.rs:314] Host CPU vendor ID: [71, 101, 110, 117, 105, 110, 101, 73, 110, 116, 101, 108]
2023-12-25T17:24:28.387891803 [anonymous-instance:main:INFO:src/vmm/src/persist.rs:315] Snapshot CPU vendor ID: [71, 101, 110, 117, 105, 110, 101, 73, 110, 116, 101, 108]
2023-12-25T17:24:28.413620267 [anonymous-instance:main:ERROR:src/vmm/src/devices/virtio/queue.rs:296] virtio queue number of available descriptors 4097 is greater than queue max size 256
2023-12-25T17:24:28.413716156 [anonymous-instance:main:INFO:src/vmm/src/lib.rs:818] Vmm is stopping.
2023-12-25T17:24:28.481140691 [anonymous-instance:fc_api:ERROR:src/api_server/src/parsed_request.rs:190] Received Error. Status code: 400 Bad Request. Message: Load snapshot error: Failed to restore from snapshot: Failed to build microVM from snapshot: Failed to restore MMIO device: Cannot restore devices: VirtioBlock(Persist(InvalidInput))
2023-12-25T17:24:28.481173674 [anonymous-instance:fc_api:WARN:src/api_server/src/lib.rs:139] PUT /snapshot/load: mem_file_path field is deprecated.
2023-12-25T17:24:28.481367990 [anonymous-instance:main:ERROR:src/firecracker/src/main.rs:94] RunWithApiError error: Failed to build MicroVM: Loading snapshot failed..
2023-12-25T17:24:28.481410903 [anonymous-instance:main:ERROR:src/firecracker/src/main.rs:97] Firecracker exiting with error. exit_code=1

Host kernel: Linux bttm 6.2.0-39-generic #40~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Nov 16 10:53:04 UTC 2 x86_64 x86_64 x86_64 GNU/Linux (uname -a)
Guest kernel: vmlinux-5.10.bin

Download script used:

# Some var
ARCH="$(uname -m)"

release_url="https://github.com/firecracker-microvm/firecracker/releases"
latest=$(basename $(curl -fsSLI -o /dev/null -w  %{url_effective} ${release_url}/latest))


curl -L ${release_url}/download/${latest}/firecracker-${latest}-${ARCH}.tgz \
| tar -xz

mv release-${latest}-$(uname -m)/firecracker-${latest}-${ARCH} firecracker
mv release-${latest}-$(uname -m)/snapshot-editor-${latest}-${ARCH} snapshot-editor

rm release-${latest}-$(uname -m) -r

wget https://s3.amazonaws.com/spec.ccfc.min/img/quickstart_guide/${ARCH}/kernels/vmlinux-5.10.bin
mv vmlinux-5.10 kernel

chmod +x ./firecracker
chmod +x ./snapshot-editor

To give more context:

  • The guest is running debian 12
  • The rootfs is build using docker
  • The guest is running docker (so docker in the microvm)
  • Docker in the guest runs a container (a minecraft server, to be exact)

If you need more details, feel free to ask 😉.

I hope someone can help me fix the issue. I will probably also ask in the slack server.

Originally posted by @BasToTheMax in #2888 (comment)

I'm currently on vacation and won't be able to do tests.

Hi @BasToTheMax ! Thanks for reporting the issue.

From our initial analysis, it looks like the block device fails to restore, because the device layout in memory is not correct.

Could you provide a reproducible test that demonstrates the issue including the following if possible:

  • (a link to) the rootfs that is used
  • which API calls (or json config) is used to configure and boot the VM
  • actions that are performed inside the VM before taking snapshots

Alternatively, we have a test that exercises differential snapshots:

. You could modify it in the way it is closer to your setup and see if it starts failing (testing readme).

Additionally, is running a docker inside the VM a principal part of the reproduction steps? Does the same sequence not fail without a docker inside?

pb8o commented

Hi @BasToTheMax were you able to solve your issue? If not can you provide a series of commands as mentioned in @kalyazin's comment?