CRaC/docs

lambda checkpoint not working from docker macOS

Opened this issue · 19 comments

I am following the lambda git example https://github.com/CRaC/example-lambda to create a lambda checkpoint and follow the steps exactly via a docker container (ubuntu 20.04). I run it from a macOS (M1 arm chip). However, it doesn't work. My steps are:

  1. Via a docker sock I run docker run --privileged --platform=linux/amd64 --rm -it -v /var/run/docker.sock:/var/run/docker.sock -v $(pwd):/$(pwd) -w $(pwd) teracy/ubuntu:20.04-dind-20.10.13 bash
  2. Once I am in the container I run ./crac-steps.sh s00_init
  3. I download the crack JDK in the container as follows

CRAC_VERSION=17-crac+6
curl -LO https://github.com/CRaC/openjdk-builds/releases/download/$CRAC_VERSION/openjdk-"$CRAC_VERSION"_linux-x64.tar.gz
tar axf openjdk-"$CRAC_VERSION"_linux-x64.tar.gz

  1. and then do ./crac-steps.sh dojlink openjdk-17-crac+6_linux-x64 which extracts the jdk folder fine
  2. ./crack-steps s01_build (works fine)
  3. start the container via ./crac-steps.sh s02_start_checkpoint (works fine)
  4. But when I do the checkpoint stuff via ./crac-steps.sh s03_checkpoint I get

dump.log

the command root@2ee49f701218:/tmp/sub/jdk/lib# ./criu check --all produced

Warn (criu/kerndat.c:1349): Can't get pidfd
Warn (criu/kerndat.c:1466): CRIU was built without libnftables support
Error (criu/util.c:705): read: Success
Warn (criu/cr-check.c:813): Dirty tracking is OFF. Memory snapshot will not work.
Warn (criu/cr-check.c:1242): Do not have API to map vDSO - will use mremap() to restore vDSO
Error (criu/cr-check.c:1208): UFFD is not supported
Error (criu/cr-check.c:1208): UFFD is not supported
Warn (criu/cr-check.c:1231): clone3() with set_tid not supported
Error (criu/cr-check.c:1273): Time namespaces are not supported
Warn (criu/cr-check.c:1300): Pidfd store requires pidfd_open syscall which is not supported
Warn (criu/cr-check.c:1334): Nftables based locking requires libnftables and set concatenations support
Error (criu/cr-check.c:996): failed to mount autofs: No such device
Warn (criu/cr-check.c:1160): compat_cr is not supported. Requires kernel >= v4.12
Looks good but some kernel features are missing
which, depending on your process tree, may cause
dump or restore failure.

it looks like it attempts to create the checkpoint but then I get right at the end (check in the log)

(00.247778) Parasite syscall_ip at 0x555555554000
(00.248048) Error (compel/arch/x86/src/lib/infect.c:518): Can't get CS register for 135: Input/output error
(00.248179) Error (compel/arch/x86/src/lib/infect.c:551): Can't dump task 135 with LDT descriptors
(00.248375) Error (criu/cr-dump.c:1566): Can't infect (pid: 135) with parasite
(00.249940) Unlock network
(00.250489) Unfreezing tasks into 1
(00.250597) Unseizing 135 into 1
(00.251079) Error (criu/cr-dump.c:2063): Dumping FAILED.

In my lambda container the exception log says

INFO: /function/lib/netty-nio-client-2.10.72.jar is recorded as always available on restore
CR: Checkpoint ...
CRIU failed with exit code 1 - check /cr/dump4.log
Command: /tmp/sub/jdk/lib/criu dump -t 135 -D /cr --shell-job -v4 -o dump4.log
JVM: invalid info for restore provided: queued code -1
END RequestId: df693adf-960e-4912-a1de-244258825b98
REPORT RequestId: df693adf-960e-4912-a1de-244258825b98 Duration: 696.74 ms Billed Duration: 697 ms Memory Size: 3008 MB Max Memory Used: 3008 MB
org.crac.CheckpointException
at org.crac.Core$Compat.checkpointRestore(Core.java:141)
at org.crac.Core.checkpointRestore(Core.java:219)
at example.Handler.lambda$handleRequest$0(Handler.java:36)
at java.base/java.lang.Thread.run(Thread.java:833)
Suppressed: java.lang.RuntimeException: Native checkpoint failed.
at java.base/jdk.crac.Core.translateJVMExceptions(Core.java:114)
at java.base/jdk.crac.Core.checkpointRestore1(Core.java:192)
at java.base/jdk.crac.Core.checkpointRestore(Core.java:299)
at java.base/jdk.crac.Core.checkpointRestore(Core.java:278)
at java.base/javax.crac.Core.checkpointRestore(Core.java:73)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at org.crac.Core$Compat.checkpointRestore(Core.java:138)
... 3 more

Any ideas what's wrong? I can see in the start of the log File /run/criu.kdat does not exist which kind of suggests the criu libs aren't on the classpath but they are as I can see the jdk folder extracted? Is the docker sock the issue or the fact that I am on macOS rather than linux?

I think that this happens as you're running the DIND container unprivileged, either add --privileged or individual capabilities: --cap-add=SYS_PTRACE --cap-add=CHECKPOINT_RESTORE. Not sure if you need --security-opt seccomp=unconfined. Without that, the inner docker can't get sufficient privileges either.

tried all of these, none of them worked, getting the same issue. I wonder what does Warn (criu/kerndat.c:1349): Can't get pidfd mean and whether it's related to the jdk criu somehow as I've tried building criu in another docker container (not using the jdk with crac) using https://github.com/checkpoint-restore/criu/blob/criu-dev/scripts/build/Dockerfile.openj9-ubuntu as an example and when I do ./criu check I don't get this pidfd warning?

The code checking pidfd is not anything that would be changed in CRaC CRIU. Have you tested things in the 'topmost' container, or in a container started in the docker-in-docker? Or do you use DIND container just to have docker client around, volume-mounting the socket for the (only) docker?

I use the DIND Docker Linux container to be able to test the CraC checkpoint functionality because I run from a Mac M1 machine and CraC CRIU jdk doesn't support MacOS arm chip, it could only run on LINUX from what I understand?

Yes, only Linux is supported ATM. As of this moment there's no CRIU for OSX (running on M1 doesn't matter much, though there might be aarch64-specific bugs ofc). But I was not sure if you have multiple layers of docker, or only VM running Linux (started by OSX Docker) that runs single docker instance, and that's all you use.

so is there any solution for macOS currently? Is there any timeline/scope of a release for a CraC Criu for OSX? Thank you

I am not aware of anyone trying to port CRIU to OSX, it would be a tremendous effort.
Your idea of running in docker (on Linux VM) makes sense; CRaC should be able to run that way. Regrettably I can't give you any estimate when I'd might have more time to test it out myself.

I think the issue for that pidfd error might be the fact that I am running on mac M1 which is an arm architecture and it looks like CraC isn't supported in ARM even from docker as suggested here https://docs.azul.com/core/crac/crac-guidelines#running-crac-on-windows-or-macos? For instance, I get this issue when I start the docker container with --platform=linux/amd64 but when I start it via --platform linux/arm64/v8 or --platform linux/arm64/v7 I get the error rosetta error: failed to open elf at /lib64/ld-linux-x86-64.so.2 which kind of suggest it's ARM system problem? Could you please confirm that might be the issue?

You are simply treading into uncharted territory; yes, docs say that it would work on x86_64 only because we don't test (and commit to) other platforms that much, in this phase. There has been some work to support ARM in CRIU, see e.g. CRaC/criu@d62b80e and the CRaC code in JDK is not arch-specific - that's why I am thinking it could work.
TBH I don't have any clue how rosetta runs virtualized OS (docker engine) - is this executed as x86_64 or as aarch64? The container must match to the underlying OS.

@tzvetkovg have you tried aarch64 build of CRaC? Like https://www.azul.com/downloads/?version=java-17-lts&os=linux&architecture=arm-64-bit&package=jdk-crac#zulu

I don't remember seeing that error:

(00.247778) Parasite syscall_ip at 0x555555554000
(00.248048) Error (compel/arch/x86/src/lib/infect.c:518): Can't get CS register for 135: Input/output error
(00.248179) Error (compel/arch/x86/src/lib/infect.c:551): Can't dump task 135 with LDT descriptors
(00.248375) Error (criu/cr-dump.c:1566): Can't infect (pid: 135) with parasite
(00.249940) Unlock network
(00.250489) Unfreezing tasks into 1
(00.250597) Unseizing 135 into 1
(00.251079) Error (criu/cr-dump.c:2063): Dumping FAILED.

I can imagine the problem can be caused by problems in the underlying VM for the container. AFAIR, cross-CPU container works less reliable than when CPU in the container and the host match.

BTW, I think we have outdated info on https://docs.azul.com/core/crac/crac-guidelines#running-crac-on-windows-or-macos. Thanks, we'll fix it.

@AntonKozlov I've tried using the aarch64 build as you've suggested. This has solved the crac pidfd issue. However, now I am getting a different error when checkpointing :

java.lang.UnsatisfiedLinkError: /tmp/.aws-lambda-runtime-interface-client: /tmp/.aws-lambda-runtime-interface-client: cannot open shared object file: No such file or directory (Possible cause: can't load AMD 64 .so on a AARCH64 platform)Failed to load the native runtime interface client library aws-lambda-runtime-interface-client.glibc.so. Exception: /tmp/.aws-lambda-runtime-interface-client: /tmp/.aws-lambda-runtime-interface-client: cannot open shared object file: No such file or directory (Possible cause: can't load AMD 64 .so on a AARCH64 platform)

even though the crac dependency

<dependency> <groupId>io.github.crac.com.amazonaws</groupId> <artifactId>aws-lambda-java-runtime-interface-client</artifactId> <version>1.0.0</version> </dependency>

is available in the pom so the no such file is misleading by the looks of it and it's more of an issue with amd 64 aarch 64? Does that mean the crac aws client aws-lambda-runtime-interface-client is only available for AMD but not ARM? How do I configure the crac aws lambda client when building the docker image using the aarch64 distribution? The suggested aarch64 jdk seems to be different than the jdk used in the tutorial? Any suggestions? Thanks

@AntonKozlov in addition to the above I've tried building the POC exactly as described here https://github.com/CRaC/example-lambda with the crac aarch64 JDK you've suggested and all steps are fine (including the ./criu check) until I attempt to build the checkpoint from within the docker ubuntu container so when I run
docker build -t crac-lambda-checkpoint -f Dockerfile.checkpoint . with the aarch64 jdk it fails on

RUN /prepare-jdk.cmd.sh ;
cd /function/lib; /jdk/bin/jar -x -f aws-lambda-java-runtime-interface-client*.jar
aws-lambda-runtime-interface-client.musl.so
aws-lambda-runtime-interface-client.glibc.so

it fails with

=> ERROR [stage-1 7/9] RUN /prepare-jdk.cmd.sh ; cd /function/lib; /jdk/bin/jar -x -f aws-lambda-java-runtime-interface-client*.jar aws-lambda-runtime-inte 0.6s


[stage-1 7/9] RUN /prepare-jdk.cmd.sh ; cd /function/lib; /jdk/bin/jar -x -f aws-lambda-java-runtime-interface-client*.jar aws-lambda-runtime-interface-client.musl.so aws-lambda-runtime-interface-client.glibc.so:
/jdk/lib/server/libj : decoded 23987840 bytes
0.423 Exception in thread "main" java.lang.InternalError: Error loading java.security file
0.424 at java.base/java.security.Security.initialize(Security.java:106)
0.424 at java.base/java.security.Security$1.run(Security.java:84)
0.424 at java.base/java.security.Security$1.run(Security.java:82)
0.424 at java.base/java.security.AccessController.doPrivileged(AccessController.java:318)
0.424 at java.base/java.security.Security.(Security.java:82)
0.424 at java.base/sun.security.util.SecurityProperties.getOverridableProperty(SecurityProperties.java:57)
0.424 at java.base/sun.security.util.SecurityProperties.privilegedGetOverridable(SecurityProperties.java:48)
0.424 at java.base/sun.security.util.SecurityProperties.includedInExceptions(SecurityProperties.java:72)
0.424 at java.base/sun.security.util.SecurityProperties.(SecurityProperties.java:36)
0.424 at java.base/sun.security.util.FilePermCompat.(FilePermCompat.java:45)
0.424 at java.base/java.security.AccessControlContext.(AccessControlContext.java:269)
0.424 at java.base/java.security.AccessController.createWrapper(AccessController.java:647)
0.424 at java.base/java.security.AccessController.doPrivileged(AccessController.java:460)
0.424 at java.base/java.util.ResourceBundle$ResourceBundleProviderHelper.loadResourceBundle(ResourceBundle.java:3614)
0.424 at java.base/java.util.ResourceBundle.loadBundle(ResourceBundle.java:1837)
0.424 at java.base/java.util.ResourceBundle.findBundle(ResourceBundle.java:1768)
0.424 at java.base/java.util.ResourceBundle.findBundle(ResourceBundle.java:1722)
0.424 at java.base/java.util.ResourceBundle.findBundle(ResourceBundle.java:1722)
0.425 at java.base/java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1656)
0.425 at java.base/java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1575)
0.425 at java.base/java.util.ResourceBundle.getBundleImpl(ResourceBundle.java:1549)
0.425 at java.base/java.util.ResourceBundle.getBundle(ResourceBundle.java:858)
0.425 at jdk.jartool/sun.tools.jar.Main.(Main.java:195)

it looks like this aarch64 jdk isn't configured in the same way as the one in the lambda example?

java.lang.UnsatisfiedLinkError: /tmp/.aws-lambda-runtime-interface-client: /tmp/.aws-lambda-runtime-interface-client: cannot open shared object file: No such file or directory (Possible cause: can't load AMD 64 .so on a AARCH64 platform)

Oh, I see. aws-lambda-java-runtime-interface-client:1.0.0 does not support aarch64, all the native libs inside that jar are x86-64. The following error about java.security can be related.

We can update io.github.crac.com.amazonaws.aws-lambda-java-runtime-interface-client. Althought it would take some time, unlikely it will happen before mid-January.

Another option is to rework the lambda example. With AWS API Gateway, apparently we can avoid dependencies on AWS libs, and the lambda code will become just a simpliest example like https://github.com/CRaC/example-jetty, packaged in container images for AWS. Probably we'll follow this way, but it will also won't be very fast. @tzvetkovg if you're interested, contributions are welcomed! :)

java.lang.UnsatisfiedLinkError: /tmp/.aws-lambda-runtime-interface-client: /tmp/.aws-lambda-runtime-interface-client: cannot open shared object file: No such file or directory (Possible cause: can't load AMD 64 .so on a AARCH64 platform)

Oh, I see. aws-lambda-java-runtime-interface-client:1.0.0 does not support aarch64, all the native libs inside that jar are x86-64. The following error about java.security can be related.

We can update io.github.crac.com.amazonaws.aws-lambda-java-runtime-interface-client. Althought it would take some time, unlikely it will happen before mid-January.

Another option is to rework the lambda example. With AWS API Gateway, apparently we can avoid dependencies on AWS libs, and the lambda code will become just a simpliest example like https://github.com/CRaC/example-jetty, packaged in container images for AWS. Probably we'll follow this way, but it will also won't be very fast. @tzvetkovg if you're interested, contributions are welcomed! :)

@AntonKozlov aah I see, that's great, thank you for your efforts. I am looking forward to this change.

hi @AntonKozlov , thanks for merging the PR with the new libs. I've also seen the new lambda PR CRaC/example-lambda#3 which isn't yet merged. I've tried to pull it locally but noticed the new

<dependency>
  <groupId>io.github.crac.com.amazonaws</groupId>
  <artifactId>aws-lambda-java-runtime-interface-client</artifactId>
  <version>2.4.1.CRAC.0</version>
</dependency>`

isn't pushed to maven central https://mvnrepository.com/artifact/io.github.crac.com.amazonaws/aws-lambda-java-runtime-interface-client

Hi @tzvetkovg, thanks for noticing, I published the version to the central.

@AntonKozlov thanks, I've managed to build the checkpoint POC https://github.com/crac/example-lambda following your changes! One thing I am struggling now is checkpointing a real spring boot lambda application, when I attempt to do the checkpoint I get as an exception

org.crac.CheckpointException
2024-01-17 09:03:27 at org.crac.Core$Compat.checkpointRestore(Core.java:144)
2024-01-17 09:03:27 at org.crac.Core.checkpointRestore(Core.java:237)
2024-01-17 09:03:27 at uk.co.ii.loader.screener.MyRecordService.lambda$startCrac$0(MyRecordService.java:27)
2024-01-17 09:03:27 at reactor.core.publisher.MonoRunnable.call(MonoRunnable.java:73)
2024-01-17 09:03:27 at reactor.core.publisher.MonoRunnable.call(MonoRunnable.java:32)
2024-01-17 09:03:27 at reactor.core.publisher.FluxSubscribeOnCallable$CallableSubscribeOnSubscription.run(FluxSubscribeOnCallable.java:227)
2024-01-17 09:03:27 at io.micrometer.context.ContextSnapshot.lambda$wrap$0(ContextSnapshot.java:78)
2024-01-17 09:03:27 at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
2024-01-17 09:03:27 at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
2024-01-17 09:03:27 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
2024-01-17 09:03:27 at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
2024-01-17 09:03:27 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
2024-01-17 09:03:27 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
2024-01-17 09:03:27 at java.base/java.lang.Thread.run(Thread.java:1583)
2024-01-17 09:03:27 Suppressed: jdk.internal.crac.mirror.impl.CheckpointOpenSocketException: Socket[addr=ssm.eu-west-1.amazonaws.com/67.220.227.15,port=443,localport=44452]
2024-01-17 09:03:27 at java.base/jdk.internal.crac.JDKSocketResourceBase.lambda$beforeCheckpoint$0(JDKSocketResourceBase.java:68)
2024-01-17 09:03:27 at java.base/jdk.internal.crac.mirror.Core.checkpointRestore1(Core.java:169)
2024-01-17 09:03:27 at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:286)
2024-01-17 09:03:27 at java.base/jdk.internal.crac.mirror.Core.checkpointRestore(Core.java:265)
2024-01-17 09:03:27 at jdk.crac/jdk.crac.Core.checkpointRestore(Core.java:72)
2024-01-17 09:03:27 at java.base/jdk.internal.reflect.DirectMethodHandleAccessor.invoke(DirectMethodHandleAccessor.java:103)
2024-01-17 09:03:27 at java.base/java.lang.reflect.Method.invoke(Method.java:580)
2024-01-17 09:03:27 at org.crac.Core$Compat.checkpointRestore(Core.java:141)
2024-01-17 09:03:27 ... 13 more
2024-01-17 09:03:27 Caused by: java.lang.Exception: This file descriptor was created by main at epoch:1705482193072 here
2024-01-17 09:03:27 at java.base/jdk.internal.crac.JDKFdResource.(JDKFdResource.java:60)
2024-01-17 09:03:27 at java.base/jdk.internal.crac.JDKSocketResourceBase.(JDKSocketResourceBase.java:44)
2024-01-17 09:03:27 at java.base/jdk.internal.crac.JDKSocketResource.(JDKSocketResource.java:38)

which is clearly caused by the open awssdk SSM socket that I have in my project to read some ssm properties CheckpointOpenSocketException: Socket[addr=ssm.eu-west-1.amazonaws.com/67.220.227.15,port=443,localport=44452]

I wonder what's the best way of dealing with this and errors like that? How do I close this socket to do the checkpoint? Do I need to re-open later on upon restoring?

I thought if I just add -Dspring.context.checkpoint=onRefresh to when I start the simulator so something like

/aws-lambda-rie /jdk/bin/java \
-Dspring.context.checkpoint=onRefresh \
.....

as described here https://docs.spring.io/spring-framework/reference/integration/checkpoint-restore.html spring would solve it automatically by closing that socket but I guess it this is only supported from spring boot 3.2?

Also, I wonder is it possible to overwrite some of the environment variables when invoking the checkpointed image from the cr folder, for instance consider this

 exec /aws-lambda-rie java \
 -Dspring.profiles.active=devl \ #see this
  .....
  -XX:CRaCRestoreFrom=/cr

initially, I created the checkpoint image with this property being -Dspring.profiles.active=local, the new spring profile doesn't seem to change anything when I tested?

I would expect -Dspring.context.checkpoint=onRefresh to work. Could you please the example?
UPD: Oh, yes, the option is supposed to work only since spring-boot 3.2.

oh sorry, I've tested -Dspring.context.checkpoint=onRefresh with spring boot 3.2 and it seems to work (see the logs)

o.s.c.support.DefaultLifecycleProcessor : Triggering JVM checkpoint/restore
2024-01-18 14:58:58 2024-01-18 14:58:58.267 INFO [aid=my-crac-lambda,tid=,sid=,cty=] 129 --- [my-crac-lambda] [ main] [ ] jdk.crac : Starting checkpoint

the problem is that it gets stuck on this line as I think this blocks the checkpointing on the main lambda thread and it never really completes the checkpointing? When I attempt to invoke the function again with a new input I get

2024-01-18 15:10:34 START RequestId: ef8fb489-ba5b-4309-aa45-f64d90c5f169 Version: $LATEST
2024-01-18 15:10:34 18 Jan 2024 15:10:34,922 [ERROR] (rapid) Failed to reserve: AlreadyReserved

This may be to do with the AWS Lambda Runtime Interface Emulator (RIE) itself not being multithreaded? I think in your lambda example you've had the checkpoint to be invoked manually on a separate thread for a reason https://github.com/CRaC/example-lambda/blob/master/src/main/java/example/Handler.java

case "checkpoint":
(new Thread(() -> {
try {
Thread.sleep(1_000);
org.crac.Core.checkpointRestore();
} catch (CheckpointException | RestoreException | InterruptedException e) {
e.printStackTrace();
}
})).start();

I wonder if there's any workaround to achieve this when using -Dspring.context.checkpoint=onRefresh?