cloudfoundry/bosh-agent

Device path resolution times out for aws and ali

Opened this issue · 25 comments

We found that the resolution of the device path by ID times out on ali and aws. This is due to a mismatch in the disk ID and the symlink in /dev/disk/by-id.
Please see below the agent logs for aws:

2023-05-03_16:01:00.86472 [File System] 2023/05/03 16:01:00 DEBUG - Glob '/dev/disk/by-id/*vol-0762ac95a0e2714e6'
2023-05-03_16:01:00.86486 [virtioDevicePathResolver] 2023/05/03 16:01:00 DEBUG - Failed to get device real path by disk ID: 'vol-0762ac95a0e2714e6'. Error: 'Timed out getting real device path for 'vol-0762ac95a0e2714e6'', timeout: 'true'
2023-05-03_16:01:00.86487 [virtioDevicePathResolver] 2023/05/03 16:01:00 DEBUG - Using mapped resolver to get device real path
2023-05-03_16:01:00.86487 [File System] 2023/05/03 16:01:00 DEBUG - Checking if file exists /dev/disk/by-id/nvme-Amazon_Elastic_Block_Store_vol0762ac95a0e2714e6 

As can be seen the disk id contains the "-" character, however the symlink does not contain it (last log entry at the end)
The issue for ali is slightly different.
First the agent logs:

2023-05-03_16:05:38.53543 [File System] 2023/05/03 16:05:38 DEBUG - Glob '/dev/disk/by-id/*d-gw887i4b4bb3n9acrozc'
2023-05-03_16:05:38.53552 [virtioDevicePathResolver] 2023/05/03 16:05:38 DEBUG - Failed to get device real path by disk ID: 'd-gw887i4b4bb3n9acrozc'. Error: 'Timed out getting real device path for 'd-gw887i4b4bb3n9acrozc'', timeout: 'true'

and here the content of the /dev/disk/by-id folder:
image

We propose the following change: #309

This PR removes everything from the DiskID before and including the "-" character, if it exists.

Why is the cpi passing down the d-? Wouldn't it be better if that would have been stripped in the cpi instead of doing this in the agent?

What's the actual problem here? Is there a long timeout that your log snippet doesn't reveal? Based on the timestamps:

2023-05-03_16:01:00.86472
2023-05-03_16:01:00.86486
2023-05-03_16:01:00.86487
2023-05-03_16:01:00.86487

it looks like the timeout is a millisecond, so perhaps there are more log lines that show the real length of the timeout?

Also, we use the Bosh Agent in tests that deploy on many IAASs (including AWS) Those tests aren't failing, so it doesn't seem like the thing described in this Issue is causing deployment failures on AWS. Are you running into deployment failures?

Hi @klakin-pivotal, Hi @rkoster
We running in deployment failures on AliCloud. If i get it correctly the bosh agent try to find the disk by id first. This is working for some IaaS, for example Azure or GCP. As you can see in the logs above, it's failing on AWS and AliCloud.
Failed to get device real path by disk ID
But then there is a fallback. The bosh agent try to get the disk by path. On AWS this is very stable. We never saw an issue there. That's also the reason why we think that the search by id never worked for AWS. But on AliCloud we sometimes also getting a wrong path too. AliCloud is aware of this issue and will try to fix it. But we do not want to rely on it. It's causing regularly some trouble on our side.

With our PR we can improve the stability by cutting the prefix from the glob search text. The id will still be untouched, we only removing the prefix from the id. The glob will still find the correct disk. AWS for example remove the "-" from the id in there disks mounts. You can see in the logs above that they name the disk nvme-Amazon_Elastic_Block_Store_vol0762ac95a0e2714e6 and the id is vol-0762ac95a0e2714e6. AliCloud removing the prefix "d-". With our change it will directly find the disk by id and the fallback is not necessary.

beyhan commented

Regarding:

Why is the cpi passing down the d-? Wouldn't it be better if that would have been stripped in the cpi instead of doing this in the agent?

To my understanding this have to happen in the CPI create_disk method which is an API and documented on bosh.io with its return value disk_cid. I also think that it can have much more side affects to manipulate the disk ID in the CPI then adjusting the search pattern in the agent. The AWS CPI e.g. has also sophisticated logic to identify the disk. I don't think that we could handle this in the CPIs.

@rkoster @klakin-pivotal @nouseforaname

Do you agree with us? Can we bring in the change? Would be very important for us to get this in soon...

I still don't fully understand where this problem is coming from. Who changed what, and why is this suddenly an urgent issue to fix? And what makes SAP's environment different from the rest of the community, so that you all are affected more than others (or is everybody affected but do people not notice)?

The proposed solution feels hacky/brittle, as it is based on a naming convention that is not documented. But how will we know when it fails in the future? None of our tests capture the current issue so merging the proposed bandaid solution to me feels really temporary.

Hi Ruben,

the issue should not be related to SAP environment. To reproduce the issue on both AWS and ALI the following steps can be performed:

  1. Create a Deployment with one VM
  2. Create a Volume (from the Console)
  3. Attach the Volume to the VM with the BOSH attach-disk command
  4. Go to the agent logs and the error "Failed to get device real path by disk ID: 'vol-'. Error: 'Timed out getting real device path for ''', timeout: 'true'" can be observed

Why would I want to create a disk out of band and attach it to bosh using bosh attach-disk?
As in is there no way to trigger the issue when just interacting via the bosh director?
Also, the steps above still don't clarify why this suddenly has become such a big issue. As in who changed what / why?

That was just for simplifying the test case. It does not matter where the disk comes from (e.g. if it was orphaned before/created from a snapshot etc). It was/is a big issue for us, because due to some reason, ali started to provide the wrong device path back in some cases. This lead to the scenario where both ways of resolving failed (resulting in the mentioned error). Meanwhile they have adapted their CPI and we will probably just create a new CPI release to (hopefully) fix the disk mounting by device path. However, afterwards there will still be this issue present, meaning that for AWS and ALI the resolution /by-id/ will not work and the agent error "Failed to get device real path by disk ID: 'vol-'. Error: 'Timed out getting real device path for ''', timeout: 'true'" will be logged every time a disk gets attached. Therefore we still believe the change is necessary, but (hopefully) not as urgent anymore ;).

beyhan commented

I will try to summarise what I know.

Context:
From the BOSH docs "The Agent can currently identify attached disks based on either their device path, disk's ID, or SCSI volume ID". E.g. the VirtioDevicePathResolver, which shows that resolution by ID has precedence.

Impact:
On AWS and Ali the resolution by ID doesn't work and agent is falling back to device path resolution. To check on this just look for the log lines here. On AWS we don't observe any issues with this because the resolution by path is working reliably. On Ali since Jammy there are mismatches between the device path returned by the Ali cloud API an the path known inside the kernel. E.g. in some situations kernel will skip /dev/vdc and use /dev/vdd where the Ali cloud APIs return /dev/xvdc as path (Ali CPI amends it here). Or in rare situations the system disk gets the kernel path /dev/vdb and the data disk as /dev/vda. In the described situations the resolution by path will fail and also the deployment.

Why we see this:
The Service Fabrik has tests for backup and restore, where attached disks get replaced by backup disks. In this scenario very often the tests fail because after the detach/attach operation the backup disk gets a kernel path like /dev/vdd where Ali returns a path like /dev/xvdc and the resolution by path doesn't work. On rare situations deployments with disks fail because the system disks gets the /dev/vdb path and the data disk /dev/vda.

Fixes:
The Ali CPI fix aims to calculate the right path for the attached disk so that resolution by path works.

This issue is about resolution by ID done by the agent, which fails currently on Ali and AWS as described. The Ali CPI fix will unblock the situation but we still could make the resolution by ID work on AWS and Ali. If the suggested fix is not optimal what could be a better way to address this?

btw we observe same time of failure with openstack cpi. stemcell 1.95 was ok, stemcell 1.108 ko

beyhan commented

Do you have any technical details why the resolution by-id and by-path don't work? Looking into the bosh agent changes from stemcell 1.95 to 1.108 I can say that only this commit is suspicious.

beyhan commented

After discussing this during the FI WG the option we see to fix this is described here.

Hey, does this commit have anything to do with this issue? cloudfoundry/bosh-alicloud-cpi-release@032e05c (Nevermind, I see that Beyhan mentioned it on the 16th.)

It seems weird to me for the Agent to fix up device IDs coming from the CPI (which gets data from the IAAS, which I think should be the source of truth), but maybe that's because I don't have enough information about the expectations that the Agent has of Disk IDs coming from the CPI.

The current understanding is that these prefixes are not something the cpi passes in, so somewhere in the interplay between the IaaS and the OS, the prefixes get added. Given the prefixes are IaaS specific the best way to configure these is the stemcell builder, which controls the static per IaaS agent config. This PR adds the ability to configure a prefix to strip from the IDs.

Actually I'm having almost the same situation in Openstack.
When I deploy updating the stemcell from ubuntu jammy 1.102 to 1.108 or the newest shows:

Error: Action Failed get_task: Task ce868444-20c7-4913-6089-03e916de82f5 result: Adjusting persistent disk partitioning: Getting real device path: Resolving mapped device path: Timed out getting real device path for /dev/sdc

  • For the deployment running with stemcell ubuntu jammy version 1.102 runs fine, and the bosh-agent throws this debugging messages:

DEBUG - Glob '/dev/disk/by-id/*80f10b24-4105-43a0-a'
DEBUG - Checking if file exists /dev/vdb
DEBUG - Stat '/dev/vdb'
DEBUG - Resolved disk {ID:80f10b24-4105-43a0-a34f-9b69555afc20 DeviceID: VolumeID:/dev/sdc Lun: HostDeviceID: Path:/dev/sdc ISCSISettings:{InitiatorName: Username: Target: Password:} FileSystemType: MountOptions:[] Partitioner:} by ID as '/dev/vdb'

like it should be detected:
ls -ls /dev/disk/by-id/
...
virtio-80f10b24-4105-43a0-a -> ../../vdb
...

  • When I change it to jammy version 1.108 failed and the debug message throws this:

DEBUG - Adjusting size for persistent disk {ID:80f10b24-4105-43a0-a34f-9b69555afc20 DeviceID: VolumeID:/dev/sdc Lun: HostDeviceID: Path:/dev/sdc ISCSISettings:{InitiatorName: Username: Target: Password:} FileSystemType: MountOptions:[] Partitioner:}
DEBUG - Triggering UdevDevice
DEBUG - Running command 'udevadm trigger'
DEBUG - Stdout:
DEBUG - Stderr:
DEBUG - Successful: true (0)
DEBUG - Settling UdevDevice
DEBUG - Running command 'udevadm settle'
DEBUG - Stdout:
DEBUG - Stderr:
DEBUG - Successful: true (0)
DEBUG - Glob '/dev/disk/by-id/*80f10b24-4105-43a0-a34f-9b69555afc20'
DEBUG - Glob '/dev/disk/by-id/*80f10b24-4105-43a0-a34f-9b69555afc20'
DEBUG - Glob '/dev/disk/by-id/*80f10b24-4105-43a0-a34f-9b69555afc20'
DEBUG - Glob '/dev/disk/by-id/*80f10b24-4105-43a0-a34f-9b69555afc20'
DEBUG - Glob '/dev/disk/by-id/*80f10b24-4105-43a0-a34f-9b69555afc20'

DEBUG - Failed to get device real path by disk ID: '80f10b24-4105-43a0-a34f-9b69555afc20'. Error: 'Timed out getting real device path for '80f10b24-4105-43a0-a34f-9b69555afc20'', timeout: 'true'

Now my only conclusion to this is that the resolution of the disk changed like it was suggested here: #310 (comment)

According to this change: https://github.com/cloudfoundry/bosh-agent/pull/308/files
the:

diskID := diskSettings.ID[0:20] corresponds to the 20 characters which is supposed to look for into the /dev/disk/by-id directory: *80f10b24-4105-43a0-a

But this is replaced by: diskID := diskSettings, attaching the full id 80f10b24-4105-43a0-a34f-9b69555afc20 which doesn't exists in the directory.

Question to @rkoster or anybody :
Do you think the PR will solve also this problem?, because it seems to be related.

Thanks in advance.

The above mentioned PR has been merged. Now we need a set of IaaS specific PRs to bosh-linux-stemcell-builder to define a regex to strip the volume prefix.

Hi @rkoster,
maybe I'm not seeing the full picture, but i don't understand how the prefix stripping will fix the regression introduced by #308 as reported in #310 (comment). Are you sure about that?

beyhan commented

There is also discussion in the bosh-jammy Slack channel about this. I also think that the prefix stripping won't help here. Is there an OpenStack documentation/specification about the way it behaves?

@turtschan and @andinod, we're not seeing this behavior in our OpenStack tests that we run, but OpenStack is not exactly uniform in its behavior. What versions of OpenStack are you seeing this behavior with?

Hi @jpalermo,

I'm dealing with different versions of Opensatck and all of them we have the same results. However the most recent version tested is the Openstack Train.

But let me clarify something, basically at the beginning it was not possible for me to upgrade the stemcell version because of the problem of the bosh agent to detect the persistent disk due to the reasons I explained before.
I have also realized that this was not only happening in the upgrade, also it happens when I deleted completly the deployment an created a new one.

So I started to investigate if I could workaround this or have an issue in my config, because I've got the feedback from others that they didn't have the same results like me, so I guessed this could be on another issue, or bosh configuration or Openstack config.

After I studied the code as well as the comparition with the output logs of the failed deployment's vm, I realized that it follows more or less this sequence related to the persistent disks:
1- Starts the vm
2- Bosh agent tries to detect the disks.
3- Disk detection fails and send the message to the director about it.
4- Due to it fails, then the director through the CPI, tries to detect or guess what could be the disk.
5- The director sends the resultant disk to the bosh agent and the agent tries to detect it or guess which one could it be.

In this last step it also failed. The disk sent by the bosh director according to its detection was the /dev/sdc and the agent tried to detect the possible options, xvbc, vdc, etc... Of course in this case the disk is not present like this but as the /dev/vdb and the real device that should be sent by the director would be /dev/sdb or /dev/vdb.

Due to this, I checked then the logic of the detection of this disk from the CPI to the Director. The detection of the disk was as expected /dev/vdb by the CPI returned by Openstack, but the Director didn't modify anything, so I guessed that something happened in the CPI and I found it.

After tracing the error, I found that we had an option configured in out director's CPI that it was not present in the others. There is a line of the code that modifies the disk detection in case of that option is present, and it is defined in this function: first_device_name_letter(server).

We had the option config_drive to 'disk' in the CPI config, for some reason it was present and that made the difference. This was an error from the begining, present in the CPI config, but the deployments didn't get there, because the bosh agent was smart enough to detect the presistent disk properly before.

So after we unset the option, everything started to work. The CPI was sending the right value to the Director and the Director to the agent and the disk was detected at the end.

All of this is to let you know that basically for many people their deployments are still working with the new stemcells because probably they don't have that option set in their CPI, but that doesn't mean that the problem of the bosh agent detecting the disk is not present, because I still can see it in the logs, and definitely that belongs to the agent.

In fact, since that change happened, we detected that it was not possible for move to a new version of stemcell, why? because we were using probably a version were the agent had the good disk detection, but for the newer versions with the new changes, the disk was not detected and failed for us.

I hope this can help to others as well,

Thanks for all the details, that was a super helpful writeup. I confirmed what you found, that our test environment is falling back to the mappedDevicePathResolver, and working because the letters match up correctly.

It feels like we want to fix this, but I'm not sure how yet. I don't think we can use the new stripVolumeRegex because golang doesn't allow lookbehinds. We'll talk about it at the Foundation Infrastructure WG meeting next week and see how we want to go about it.

Given the stripVolumeRegex functionality is not being used yet, does it make sense to change it to use match groups instead? So if provided the first match group result will be used?

We looked at fixing this in the Agent by using or modifying the stripVolumeRegex, but the problem is the way the Agent current globs devices depends on the suffix of the glob matching exactly. So if we're not sure if the device id is supposed to be truncated or not, we don't know if we're supposed to supply a transformation or not.

This means to fix it this way we'd need a different Openstack stemcell for Openstack environments that truncate the disk id vs ones that do not (which we suspect newer Openstack environments don't)

We also looked at doing this work in the CPI via a configuration that would allow the Operator to configure if the disk id gets trimmed or not. The problem with this is that the Id is the core disk Id, not part of the disk hints. So if the CPI trims it, Bosh is then going to use it later for the delete_disk action, and the CPI isn't going to be able to find the disk.

Since the environments aren't currently broken, we don't think we need a complicated fix for Jammy stemcells. We created an Issue in the Wishlist for the 2024 stemcell where we revisit the logic happening in the Agent's idDevicePathResolver and make it more robust so it could take multiple strategies for locating the disk. We don't feel confident making this change in a patch version of Jammy though.

We encountered a similar problem to this comment with OpenStack's Queens.

Tracing the error also matched.

I am also unable to upgrade Stemcell because of the stripVolumeRegex.
Have you decided on a specific course of action since then?