aws-greengrass/aws-greengrass-nucleus

(aws-greengrass-nucleus): Intermittent hang on startup in Docker on Mac ARM (M1 Max)

Closed this issue · 28 comments

Describe the bug
When starting the Greengrass service in a Docker environment, Nucleus does not start 100% of the time. It will hang on the log output Launching Nucleus...

When investigating the logs pasted below, it appears just to be stuck. There are no critical errors in the logs that I can see. I have also turned on LOG_LEVEL=DEBUG

The issue is intermittent; there are times when it will start up correctly. There are other times when it hangs. At first, I thought it might have been an internet connectivity issue. However, after testing on many different networks, I can't seem to reproduce that as the underlying issue.

I have one simple node.js component deployed, but when Nucleus hangs, it hasn't even tried to start that component yet.

To Reproduce

Dockerfile - Built and uploaded to Elastic Container Registry

FROM amazon/aws-iot-greengrass:latest as Greengrass

ENV NODE_VERSION=16.15.1
RUN curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.0/install.sh | bash
ENV NVM_DIR=/home/ggc_user/.nvm
RUN chmod +x "${NVM_DIR}/nvm.sh"
RUN . "${NVM_DIR}/nvm.sh" && nvm install ${NODE_VERSION}
RUN . "${NVM_DIR}/nvm.sh" && nvm use v${NODE_VERSION}
RUN . "${NVM_DIR}/nvm.sh" && nvm alias default v${NODE_VERSION}
ENV PATH="${NVM_DIR}/versions/node/v${NODE_VERSION}/bin/:${PATH}"

USER root

docker-compose.yml

version: '3.8'
services:
  greengrass:
    restart: always
    container_name: greengrass-${AWS_ACCOUNT}-${AWS_REGION}
    platform: linux/amd64
    image: ${AWS_ACCOUNT}.dkr.ecr.us-west-2.amazonaws.com/iot-greengrass-docker-image-${AWS_ACCOUNT}-${AWS_REGION}:latest
    volumes:
      - ./docker/greengrass/v2/:/greengrass/v2/:rw
    ports:
      - "8080:8080"
      - "8443:8443"
    env_file:
      - .env

.env - Redacted vars

AWS_ACCOUNT=123
AWS_REGION=us-west-2

THING_NAME=TestThing
THING_SHADOW_NAME=default
IOT_ENDPOINT={accountEndpointId}-ats.iot.us-west-2.amazonaws.com
THING_GROUP_NAME=ThingGroup
THING_POLICY_NAME=ThingPolicy
TES_ROLE_NAME=RoleName
TES_ROLE_ALIAS_NAME=RoleAlias

Greengrass core component deployment document

{
  "deploymentId": "1234",
  "schemaDate": "2021-05-17",
  "deploymentName": "ThingGroup",
  "configurationArn": "arn:aws:greengrass:us-west-2:514264552060:configuration:thinggroup/ThingGroup:73",
  "creationTimestamp": 1668199452502,
  "components": {
    "aws.greengrass.TokenExchangeService": {
      "version": "2.0.3"
    },
    "aws.greengrass.ShadowManager": {
      "version": "2.2.2"
    },
    "MyComponent": {
      "version": "1.0.0"
    },
    "aws.greengrass.Nucleus": {
      "version": "2.8.1",
      "configurationUpdate": {
        "merge": {
          "jvmOptions": "-Xmx256m",
          "logging": {
            "format": "TEXT",
            "level": "DEBUG"
          },
          "telemetry": {
            "enabled": "true"
          }
        }
      }
    }
  },
  "failureHandlingPolicy": "ROLLBACK",
  "requiredCapabilities": [],
  "componentUpdatePolicy": {
    "timeout": 60,
    "action": "NOTIFY_COMPONENTS"
  },
  "configurationValidationPolicy": {
    "timeout": 60
  }
}

Expected behavior

Greengrass Nucleus should start 100% of the time.

Actual behavior

Greengrass Nucleus hangs on Launching Nucleus...

Environment

  • OS: linux/amd64
  • JDK version:
Openjdk 11.0.13 2021-10-19 LTS
OpenJDK Runtime Environment Corretto-11.0.13.8.2 (build 11.0.13+8-LTS)
OpenJDK 64-Bit Server VM Corretto-11.0.13.8.2 (build 11.0.13+8-LTS, mixed mode)
  • Nucleus version: 2.8.1
  • Docker Desktop: 4.13.1
  • NodeJs version: 16.15.1
  • Docker Host:
2022 Mac Studio
Monterey 12.5
M1 Max
32 GB Ram

Additional context
The impact of the bug is critical. The service will not start 100% of the time.

Here are the logs from /greengrass/v2/logs/greengrass.log

https://gist.github.com/seawatts/e4a556712320110a774665fa911fa315

Updated

MyComponent Recipe

{
  "RecipeFormatVersion": "2020-01-25",
  "ComponentName": "MyComponent",
  "ComponentVersion": "1.0.0",
  "ComponentType": "aws.greengrass.generic",
  "ComponentDescription": "Component for MyComponent",
  "ComponentPublisher": "MyCompany",
  "Manifests": [
    {
      "Lifecycle": {
        "Recover": {
          "RequiresPrivilege": true,
          "Script": "NODE_ENV=production /home/ggc_user/.nvm/versions/node/v16.15.1/bin/node {artifacts:decompressedPath}/hash1234/index.js"
        },
        "Run": {
          "RequiresPrivilege": true,
          "Script": "NODE_ENV=production /home/ggc_user/.nvm/versions/node/v16.15.1/bin/node {artifacts:decompressedPath}/hash1234/index.js"
        }
      },
      "Artifacts": [
        {
          "Uri": "s3://cdk-hnb659fds-assets-accountId-us-west-2/hash1234.zip",
          "Digest": "+1234",
          "Algorithm": "SHA-256",
          "Unarchive": "ZIP",
          "Permission": {
            "Read": "OWNER",
            "Execute": "NONE"
          }
        }
      ]
    }
  ],
  "Lifecycle": {}
}

Hello,
The logs in the gist are not available to us, can you please make it public?

@MikeDombo hmm, it says it's public. Do you have another recommendation?

image

I'm now able to access it.

These logs don't show any issue, it looks like Greengrass is running.

Other than not seeing the output "Launched Nucleus", are you experiencing any problems? Can you deploy to the device and have your components execute?

Log shows: 2022-11-11T18:15:43.563Z [DEBUG] (MyComponent-lifecycle) com.aws.greengrass.lifecyclemanager.GenericExternalService: service-state-transition-start. {serviceName=MyComponent, currentState=RUNNING}

Which tells me that Greengrass has started your component called MyComponent

Thanks for taking a look. Yes, I did see that line. My component does not seem to be starting at that time. However, as I mentioned above, sometimes it does, just not 100% of the time.

When this happens, I also do not see an MQTT message sent to the '$aws/events/presence/#' topic

The log also shows Successfully connected to AWS IoT Core and we are able to successfully subscribe and check for deployments. I'm not seeing anything here which is wrong. This means that our MQTT connection is definitely working properly.

The missing "Nucleus launched" log looks to me that there's just some output buffering in Docker which isn't flushing the "launched" log.

Yeah, that line looks correct as well. That makes sense about the output buffer. I updated the main description with MyComponent Recipe

Recipe looks ok. Your component is being started. You should look at the component's log file if you're having a problem with it. (/greengrass/v2/logs/MyComponent.log)

When looking at the component logs, the file is empty when it does not start. When it does start correctly, the logs show up correctly there. (/greengrass/v2/logs/MyComponent.log)

Thanks for being so responsive, by the way. 😄 Very impressed.

If you're in the docker container, can you use ps to look at running processes, see if the node process is running.

If you haven't restarted the container, it looks like it should be under pid 99.

sh-4.2# ps
  PID TTY          TIME CMD
    1 ?        00:00:00 loader
   20 ?        03:42:09 java
  106 ?        00:00:00 ps
sh-4.2# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.2 160524 11364 ?        Ssl  18:17   0:00 /usr/bin/qemu-x86_64 /bin/sh /bin/sh -x /greengrass/v2/alts/current/distro/bin/loader
root        20  100  3.1 3002124 127536 ?      Sl   18:17 222:51 /usr/bin/qemu-x86_64 /usr/bin/java java -Dlog.store=FILE -Xmx256m -Droot=/greengrass/v2 -jar /greengrass/v2/alts/current/distro/lib/Greengrass.jar --setup-system-service false
root        39  0.0  0.3 161140 13600 pts/0    Ssl  18:20   0:00 /usr/bin/qemu-x86_64 /bin/sh /bin/sh
root       109  0.0  0.2 200852 10312 ?        Rl+  18:17   0:00 ps aux

Have any of the logs updated since the gist?

Looks like Greengrass is using 100% of the CPU, so it seems to be churning hard on something.

Please also post the contents of /greengrass/v2/config and /greengrass/v2/deployments.

Hmm, maybe the component is silently exiting? Not sure how that would happen. The nodejs component opens a Websocket server for another local application to connect.

ls -la /greengrass/v2/config 
total 2616
drwx------ 2 root root    4096 Nov 11 18:15 .
drwxr-xr-x 1 root root    4096 Nov 11 18:15 ..
-rw-r--r-- 1 root root    3489 Oct 28 21:37 bootstrap.tlog
-rw-r--r-- 1 root root     158 Oct 28 21:36 bootstrap.tlog~
-rw-r--r-- 1 root root 1180201 Nov 11 18:15 config.tlog
-rw-r--r-- 1 root root 1453839 Nov  2 03:59 config.tlog~
-rw-r--r-- 1 root root    6290 Nov 11 18:15 effectiveConfig.yaml
-rw-r--r-- 1 root root    6259 Nov 11 18:15 effectiveConfig.yaml~
sh-4.2# ls -la  /greengrass/v2/deployments 
total 16
drwx------ 3 root root 4096 Nov 11 09:22 .
drwxr-xr-x 1 root root 4096 Nov 11 18:15 ..
drwxr-xr-x 2 root root 4096 Nov 11 09:22 957c24b1-e5de-422d-8692-17fdd281b035
lrwxrwxrwx 1 root root   63 Nov 11 09:22 previous-success -> /greengrass/v2/deployments/957c24b1-e5de-422d-8692-17fdd281b035

Should I post the contents of some of those files?

Can you share the config.tlog and config.tlog~.

Please also have a look at the kernel log using dmesg our journalctl. This feels like something is odd going on with the docker container.

Have you tried running Greengrass without Docker?

I just copied the whole config.tlog file, it looks like it has previous deployments in it as well, wasn't sure where to cut it off so I just pasted the whole thing.

config.tlog
https://gist.github.com/seawatts/c1e1bbaeb51100ac99673b13666132d7

config.tlog~
https://gist.github.com/seawatts/6f9c040509917b56089c61eee81c88cf

sh-4.2# sudo dmesg
dmesg: read kernel buffer failed: Operation not permitted
sh-4.2# sudo journalctl
No journal files were found.
-- No entries --

Run the dmesg and journalctl commands on the host, not inside of docker. I'm looking for errors or any log from docker itself in the Linux Kernel.

dmesg
https://gist.github.com/seawatts/1ed23bd3a51952c489c1ca726ca58f67

journalctl

$ journalctl
command not found: journalctl

Host computer is a Mac Studio running:

Monterey 12.5
M1 Max
32 GB Ram

I see.

Have you ever run this on a Linux host with (or without) Docker? Docker on Mac, particularly M1 Mac certainly could have problems. We have not validated support for Docker on Macs. You can find our supported platforms here: https://docs.aws.amazon.com/greengrass/v2/developerguide/setting-up.html#installation-requirements

I have not run it outside Docker or on a Linux host. Currently, we are constricted to running a Mac host in our production environment due to other software we need to be running.

The interesting thing to me is that it works some of the times. This is not a calculated number; however, it feels like it works about 70% - 80% of the time.

All I do to get it working is restart the Docker container. This is fine for development purposes; however, it's not scalable in our production environment.

In the deployment you're setting "jvmOptions": "-Xmx256m". If you do not limit the memory usage, does this behavior go away?

From the output of ps, java is using 100% of the CPU which means that Greengrass is doing a lot of work, or Java is doing a lot of work. Restricting the heap size can potentially cause this because Java would need to collect garbage much more, and then this would slow everything down.

Have any of the logs under /greengrass/v2/logs been updated since we started?

My suggestions over the weekend and next week would be

  1. Deploy an absolute minimum of components - only Nucleus and "MyComponent"
  2. Remove the memory limit by setting jvmOptions to ""
  3. Test on an x86 Mac (is possible) or a supported device (Linux and Windows)
  4. Reach out to your AWS account manager or AWS support so that we can have a meeting to work on the problem together

Awesome, thanks for the help so far. I'll go through those outlined steps next and update this thread.

None of the logs have been updated since we started. Would you like me to share anything else specifically?

Regarding going through AWS Support and contacting you, is there something specific I should mention to have us scheduled together?

You can point them to my comment suggesting a meeting and they'll work with you and our oncall engineer to set that up.

Great, thanks! 😄

Any updates?

Closing out this issue. Reopen as necessary.

I tried out the options above. And nothing was able to work with M1 Mac. This issue should probably stay open, as it is a bug in Greengrass.

I gave up on using Greengrass for now.

Given that Greengrass does not support Macintosh at all, this is not a bug which we will address (assuming that it is a bug).

Please use one of our supported systems in future: https://docs.aws.amazon.com/greengrass/v2/developerguide/setting-up.html#greengrass-v2-supported-platforms