buildkite/elastic-ci-stack-for-aws

Using NVMe storage appears to fail in v6.0.0

geoffharcourt opened this issue ยท 6 comments

Describe the bug
We just upgraded from 5.16 to 6.0 and our jobs fail shortly after they begin.

Steps To Reproduce
Steps to reproduce the behavior:

  1. Set up the stack with m5dn.large instances and enable EnableInstanceStorage
  2. Run a job
  3. Jobs fail with a message like this:


Setting up elastic stack environment (v6.0.0) | 0s
-- | --
  | Checking docker
  | CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES
  | Checking disk space
  | df: /mnt/ephemeral/docker: No such file or directory
  | Cleaning up docker resources older than 4h
  | Total reclaimed space: 0B
  | Checking disk space again
  | df: /mnt/ephemeral/docker: No such file or directory
  | Disk health checks failed

Expected behavior
In v5 we were able to use NVMe storage on these instance types with the EnableInstanceStorage setting.

Actual behaviour
Jobs fail at the disk health check phase

Stack parameters (please complete the following information):

  • AWS Region: us-east-2
  • Version v6.0.0
  • instance types: m5dn.large
  • EnableInstanceStorage: true
  • RootVolumeEncrypted: true

Additional context
Add any other context about the problem here.

I can confirm this is a bug - I built my own AMI with the following and the boxes came up.

echo 'mkdir -p /mnt/ephemeral/docker' | sudo tee -a /usr/local/bin/bk-mount-instance-storage.sh

Thanks for reporting this @geoffharcourt and @joemiller. And huge props to @mpestritto for finding a fix ๐Ÿ’–

We will get this out with the next version.

I am still seeing failures at boot on new nodes with instance storage enabled. Is it just me, or are others still seeing this too?

@joemiller If you can send us the parameters you are setting on the stack, and some cloudwatch logs, especially from the log group /buildkite/elastic-stack/<instance-id>. You can send these to support@buildkite.com if there are sensitivity concerns.

@triarius sure, I can't do it at the moment but should be able to get that info in a day or two ๐Ÿ‘

@triarius thanks for the hint about the cloudwatch path. I found this:

/usr/local/bin/bk-mount-instance-storage.sh: line 32: mdadm: command not found

the instance type is c6id.32xlarge and has 4 nvme disks: