aws-greengrass/aws-greengrass-nucleus

(nucleus): deployment speed much slower than in gg version 1

Closed this issue · 6 comments

Describe the bug
We are developing an IoT device which ships a captive portal.
Our users are presented a wizard to configure (a) internet connectivity and (b) provision greengrass via trusted user flow.
Then, they use the deployed greengrass components to interact with the device.

Our issue is that the greengrass nucleus deployment takes very long (about 2-3 minutes to finish).
For a user that configures our device via the captive portal the greengrass deployment speed is quite noticable.
Our deployment took about 10-20 seconds in greengrass version 1.

To Reproduce
I do not know what information to provide here. Please tell me and I will delivery it asap.

Expected behavior
The nucleus deployment should not take 5-10x as long as in the previous Greengrass version.

Actual behavior
2-3 minutes deployment times instead of 10-20 seconds.

Environment

  • OS: custom Yocto build
  • JDK version: see latest yocto recipe
  • Nucleus version: 2.7.0

Additional context
Add any other context about the problem here.

It is hard to argue this is an improvement over greengrass v1. We want to understand whether we are doing something fundamentally wrong or whether it is expected that deployments in v2 are far slower than in v1.

Please provide DEBUG level logs showing the deployment executing. Set the log level in the Greengrass Nucleus configuration: https://docs.aws.amazon.com/greengrass/v2/developerguide/greengrass-nucleus-component.html#greengrass-nucleus-component-configuration. "logging": {"level": "DEBUG"}.

Deployments are going to primarily depend on your own component's artifact size and bandwidth available. For example, downloading a component which has a 5GB docker image is going to take a while.

Deployments which do not require changing the versions of components will not re-download any artifacts. These deployments should complete in approximately 30 seconds.

You can also update deploymentPollingFrequencySeconds as described in the documentation I linked above. Set it to "1" to decrease the deployment check interval to 1 second. The default is 15 seconds, which means that it can take up to 15 seconds before the device will start executing the deployment after receiving it.

Hi @MikeDombo!

Thanks for the information! Some of it was very helpful to get the debugging going.
I started an initial deployment (i.e. without any artifacts being on the device) with nucleus in DEBUG mode.
I attached the resulting greengrass.log. I further attached the effectiveConfig.yaml.
Could you please skim over it and tell me what is taking so long?

Deployment Start Time: 14:04:26
Deployment End Time: 14:06:43

greengrass.log.txt
effectiveConfig.yaml.txt

In this deployment, the Nucleus is restarted due to the bootstrap script changing, this is adding ~40 seconds or so to the deployment time.

Subsequent deployments will not require this restart, so they'll go faster.

The bootstrap script changed because when Greengrass was installed, the jvmOptions configuration option was not found. You can avoid this initial restart by setting up jvmOptions Nucleus configuration option during the initial installation.

Or, as I mentioned, this is only the initial deployment, all future deployments will be faster.

Thanks for mentioning the jvmOptions issue @MikeDombo! We will try to set that option correctly.
Do you know how to further speed up the initial deployment?

Since our users are configuring the device via a captive portal we want to minimize initial deployment time.
If possible, we would like to prepare a default deployment containing aws-components so that we only deploy our custom components when a user first configures the device.

The idea behind that is, we have a set of common components which every device requires (e.g. aws secure tunneling) and a user-specific set of components. The latter one can be deployed after the user connected the device to the internet via the captive-portal.

No, there isn't anything else I can suggest to speed things up. Your logs all seem appropriate and nothing is unexpected.

I tested the fix and it takes 50 seconds right now. That's acceptable. Perfect! Thanks again!