alan-turing-institute/data-safe-haven

Path clash with Azure local storage

Closed this issue · 27 comments

✅ Checklist

  • I have searched open and closed issues for duplicates.
  • This is a problem observed when deploying a Data Safe Haven.
  • I can reproduce this with the latest version.
  • I have read through the documentation.
  • This isn't an open-ended question (open a discussion if it is).

💻 System information

  • Operating System: Debian bookworm
  • Data Safe Haven version: develop

📦 Packages

List of packages
acme==2.10.0
annotated-types==0.7.0
appdirs==1.4.4
Arpeggio==2.0.2
attrs==24.2.0
azure-common==1.1.28
azure-core==1.31.0
azure-identity==1.18.0
azure-keyvault-certificates==4.8.0
azure-keyvault-keys==4.9.0
azure-keyvault-secrets==4.8.0
azure-mgmt-compute==33.0.0
azure-mgmt-containerinstance==10.1.0
azure-mgmt-core==1.4.0
azure-mgmt-dns==8.1.0
azure-mgmt-keyvault==10.3.1
azure-mgmt-msi==7.0.0
azure-mgmt-rdbms==10.1.0
azure-mgmt-resource==23.1.1
azure-mgmt-storage==21.2.1
azure-storage-blob==12.23.1
azure-storage-file-datalake==12.17.0
azure-storage-file-share==12.18.0
certifi==2024.8.30
cffi==1.17.1
charset-normalizer==3.3.2
chevron==0.14.0
click==8.1.7
cryptography==43.0.1
-e git+https://github.com/craddm/data-safe-haven.git@a6a6993ad7bbcc02e8f05629fe0fd5ab7154a900#egg=data_safe_haven
debugpy==1.8.6
dill==0.3.9
dnspython==2.6.1
fqdn==1.5.1
grpcio==1.60.2
idna==3.10
isodate==0.6.1
josepy==1.14.0
markdown-it-py==3.0.0
mdurl==0.1.2
msal==1.31.0
msal-extensions==1.2.0
msrest==0.7.1
oauthlib==3.2.2
parver==0.5
portalocker==2.10.1
protobuf==4.25.5
psycopg==3.2.3
pulumi==3.134.1
pulumi_azure_native==2.63.0
pulumi_random==4.16.6
pycparser==2.22
pydantic==2.9.2
pydantic_core==2.23.4
Pygments==2.18.0
PyJWT==2.9.0
pyOpenSSL==24.2.1
pyRFC3339==1.1
pytz==2024.2
PyYAML==6.0.2
requests==2.32.3
requests-oauthlib==2.0.0
rich==13.8.1
semver==2.13.0
setuptools==75.1.0
shellingham==1.5.4
simple_acme_dns==3.1.0
six==1.16.0
typer==0.12.5
typing_extensions==4.12.2
urllib3==2.2.3
validators==0.28.3
websocket-client==1.8.0

🚫 Describe the problem

Virtual machine sizes with local storage (mounted at /mnt) clash with NFS mounts defined in /etc/fstab.

🌳 Log messages

Relevant log messages

image

♻️ To reproduce

Deploy a workspace using a VM size with local storage.

I saw the same problem when we first tested #2092. @JimMadge : can you take a look?

Same missing setting you report there - ldap missing from group and passwd image

Looks like the problem was you already have a disk mounted at /mnt. What size are you using?

@JimMadge : Is something automatically mounted at /mnt? Possibly the (badly documented) temp disk that lives on the same physical machine as the VM?

This might also explain why mounting at /shared was fine but /mnt/shared causes a problem.

Standard_D2s_v3

btw I literally just got it running by creating the /mnt subdirectories from the console. Maybe adding the -m flag to mount -fav, so it creates the directories if they don't exist, will fix it?

Hmm, I'm not sure. That size shouldn't have a temporary disk.

You can see in the fstab output though that there is device /dev/disk/cloud/azure_resource-part-1 at /mnt. That doesn't look right to me.
I haven't seen this problem in any deployments I've done recently.

Is it worth abandoning /mnt and putting our drives somewhere else?

Or alternatively, explicitly adding something like the following to /etc/fstab

/dev/disk/cloud/azure_resource-part1	/mnt/tmp	auto	defaults,nofail,_netdev	0	2

I'd rather understand what is happening here. It isn't a mount that we have defined and I'm not sure what it is.

If it is something Azure is adding we could make sure to remove entries like that from fstab (if that is the problem here).

This end of this section says "For Linux VMs, the temporary disk is /dev/sdb1 and is mounted at /mnt/resource or /mnt.", so I'm not sure we can disable this.

According to this outdated answer it might be possible to change in waagent.conf. However, I'm not confident that we could use cloud-init to change that file since waagent is used to run cloud-init.

Otherwise, we should see whether adding an explicit mount point for /dev/disk/cloud/azure_resource-part1 will fix it (as above).

Oh wait, the Dvs3 series do have temporary disks.

Is there a reason to use such an old offering?

I think we should be robust against the use of VM SKUs that have temporary disks (which is most of them) regardless of whether Dvs3 is a sensible SKU to use.

It's just the one that was our default on the old codebase, so I still use it as a default out of habit, but as @jemrobinson, should be robust to stuff like this IMO. (I think the GPU VMs we currently recommend -e.g. Stanard_NC6s_v3 also have local temp disks?)

(which is most of them)

I'm not sure if that is true, it tends to be the older offerings.
Looks like the high performance sizes and those with accelerators have local disks though, sometimes NVMe.

From here

Most VMs contain a temporary disk, which is not a managed disk.

I'm sure it has been true but the trend in the current and new general purpose sizes is to not include local storage and provide 'd' series variants for those that want it. Noting that line has been in the docs for quite a few years. That said, I don't think I want to enumerate all the available sizes or sizes*availability 😅.

It does seem common on sizes with accelerators though. That makes sense as the users would likely want fast storage, and the physical nodes in the data centre are more likely to have onboard storage.

We'll need to fix this to enable GPU/FPGA sizes.

Adding -o X-mount.mkdir fixed this on the Standard_D2s_v3 at least, although I'm still unable to login

e.g. while (! mountpoint -q /mnt/input); do sleep 5; mount -o X-mount.mkdir /mnt/input; done

What does that option do?

I think I'd rather just not mount local storage at /mnt so that we are consistent with machine with and without local storage.

It creates the directory if it doesn't exist

Alias --mkdir might be more clear?

Alias --mkdir might be more clear?

it would be, but for some reason that's not a supported option on the machine I tested

This doesn't really deal with the question of what happens in these scenarios:

  • temp disk mounted at /mnt
  • data mounted at /mnt/ingress
  • temp disk unmounted

OR

  • data mounted at /mnt/ingress
  • temp disk mounted at /mnt

In either case, we'd lose access to /mnt/ingress. I think it's safer/better to overwrite where the temp disk mounts or to move our mounts outside /mnt.

I think the best solution would be that we standardise where our mounts and temp disk(s) are in all cases.

I like having our mounts at /mnt, it feels idiomatic. I would put local disks at something like /scratch, /var/scratch, /mnt/scratch.

We might need some logic like "if /dev/disk/cloud/... then ..."

Agreed that /var/scratch or /mnt/scratch makes sense for the temp disk. Have you tried the fstab line I put above?

This doesn't really deal with the question of what happens in these scenarios:

  • temp disk mounted at /mnt
  • data mounted at /mnt/ingress
  • temp disk unmounted

OR

  • data mounted at /mnt/ingress
  • temp disk mounted at /mnt

In either case, we'd lose access to /mnt/ingress. I think it's safer/better to overwrite where the temp disk mounts or to move our mounts outside /mnt.

Looks like it's possible to change the location of the temp disk in /etc/waagent.conf:

There's a field ResourceDisk.MountPoint which is set to /mnt on our VMs

https://learn.microsoft.com/en-us/azure/virtual-machines/extensions/agent-linux

We might be going in circles here, but since waagent is used to run cloud-init, we probably can't use a cloud-init command to change the configuration.

Can we give arguments to waagent when deploying the machine?

So, we can add ephemeral0 to the mounts: section of our cloud-init (see here.

[ephemeral0, null] effectively doesn't mount it
[ephemeral0, /mnt/resource] mounts it to /mnt/resource. So we could mount it to /mnt/scratch or whatever you'd prefer. Works whether the VM really has an ephemeral disk or not - just an empty folder when there is no real extra disk.

At this point the desktop icons for input, output and shared are not working