Using NVMe storage appears to fail in v6.0.0
geoffharcourt opened this issue ยท 6 comments
Describe the bug
We just upgraded from 5.16 to 6.0 and our jobs fail shortly after they begin.
Steps To Reproduce
Steps to reproduce the behavior:
- Set up the stack with
m5dn.large
instances and enableEnableInstanceStorage
- Run a job
- Jobs fail with a message like this:
Setting up elastic stack environment (v6.0.0) | 0s
-- | --
| Checking docker
| CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
| Checking disk space
| df: /mnt/ephemeral/docker: No such file or directory
| Cleaning up docker resources older than 4h
| Total reclaimed space: 0B
| Checking disk space again
| df: /mnt/ephemeral/docker: No such file or directory
| Disk health checks failed
Expected behavior
In v5 we were able to use NVMe storage on these instance types with the EnableInstanceStorage
setting.
Actual behaviour
Jobs fail at the disk health check phase
Stack parameters (please complete the following information):
- AWS Region: us-east-2
- Version v6.0.0
- instance types: m5dn.large
- EnableInstanceStorage: true
- RootVolumeEncrypted: true
Additional context
Add any other context about the problem here.
I can confirm this is a bug - I built my own AMI with the following and the boxes came up.
echo 'mkdir -p /mnt/ephemeral/docker' | sudo tee -a /usr/local/bin/bk-mount-instance-storage.sh
Thanks for reporting this @geoffharcourt and @joemiller. And huge props to @mpestritto for finding a fix ๐
We will get this out with the next version.
I am still seeing failures at boot on new nodes with instance storage enabled. Is it just me, or are others still seeing this too?
@joemiller If you can send us the parameters you are setting on the stack, and some cloudwatch logs, especially from the log group /buildkite/elastic-stack/<instance-id>
. You can send these to support@buildkite.com if there are sensitivity concerns.