hashicorp/packer

[azure-arm] OS disk preserves after packer build in versions 1.6.2 and above

miketimofeev opened this issue Β· 13 comments

Overview of the Issue

Since this PR #9559 the OS disk created during packer build is not deleted at the end of the build yet according to the documentation it should be deleted https://www.packer.io/docs/builders/azure/arm
image

That unexpected behavior causes additional storage account maintenance costs, which we were not aware of. As a result, we have 60Tb (!) used storage for such temporary OS disks vhd.

Reproduction Steps

Build with packer 1.5.5:

2020-12-22T11:31:52.6025731Z   ==> azure-arm: Deleting resource group ...
2020-12-22T11:31:52.6029348Z  ==> azure-arm:  -> ResourceGroupName : '15465_imagname.ubuntu20.0'
2020-12-22T11:31:52.6046953Z  ==> azure-arm: 
2020-12-22T11:31:52.6048315Z  ==> azure-arm: The resource group was created by Packer, deleting ...
2020-12-22T11:33:37.8882125Z  ==> azure-arm: Deleting the temporary OS disk ...
2020-12-22T11:33:37.8887434Z  ==> azure-arm:  -> OS Disk : 'https://storagname.blob.core.windows.net/images/pkrosecc9u3h1uh.vhd'
2020-12-22T11:33:38.0285142Z  ==> azure-arm: Deleting the temporary Additional disk ...
2020-12-22T11:33:38.0288142Z  ==> azure-arm: Removing the created Deployment object: 'pkrdpecc9u3h1uh'

Build with packer 1.6.2 and above:

2020-12-17T06:23:11.9415718Z  ==> azure-arm: Deleting the temporary Additional disk ...
2020-12-17T06:23:11.9432558Z ==> azure-arm: Removing the created Deployment object: 'pkrdpeakhr4d0gh'
2020-12-17T06:23:27.2082893Z ==> azure-arm: 
2020-12-17T06:23:27.2099119Z   ==> azure-arm: Cleanup requested, deleting resource group ...
2020-12-17T06:25:12.4393886Z  ==> azure-arm: Resource group has been deleted.

Packer version

1.6.2 and above

Simplified Packer Buildfile

https://github.com/actions/virtual-environments/blob/main/images/linux/ubuntu2004.json

Operating system and Environment details

Any

Log Fragments and crash.log files

Provided in the Reproduction Steps

Looks like this clean up logic was removed intentionally in #9559 😒

Hi there we are aware of this issue and will be working on the fix in time for the next release. The persistence of the os disk was not intentional.

Apologies for any inconvenience this has caused.

Related to #10268

Hi πŸ‘‹ there's an open PR #10713 with a potential fix for this issue. Would you mind pulling down one of the test binaries and confirming that the patch fixes the reported issue?

The test binaries can be downloaded from https://app.circleci.com/pipelines/github/hashicorp/packer/9492/workflows/fb3a2460-88f5-4788-affb-089a5a3ddd87/jobs/113627/artifacts

Cheers!

@nywilken thanks for the update! We'll give it a try and come back with the results

@nywilken @alexeldeib I've tested the windows version of provided binaries and, unfortunately, it doesn't do the job, the OS disk is still in place after the generation.

Logs after generation with errors:

2021/03/10 11:34:08 ui: ==> azure-arm: Provisioning step had errors: Running the cleanup provisioner, if present...
2021/03/10 11:34:08 ui: ==> azure-arm: Removing the created Deployment object: 'pkrdpq9mrgdn6yc'
2021/03/10 11:34:23 ui: ==> azure-arm: ==> azure-arm: Cleanup requested, deleting resource group ...
2021/03/10 11:38:54 ui: ==> azure-arm: Resource group has been deleted.
2021/03/10 11:38:54 [INFO] (telemetry) ending azure-arm
2021/03/10 11:38:54 ui error: Build 'azure-arm' errored after 8 minutes 33 seconds: Script exited with non-zero exit status: 1.Allowed exit codes are: [0]
2021/03/10 11:38:54 ui: 
==> Wait completed after 8 minutes 33 seconds
2021/03/10 11:38:54 machine readable: error-count []string{"1"}
2021/03/10 11:38:54 ui error: 
==> Some builds didn't complete successfully and had errors:
2021/03/10 11:38:54 machine readable: azure-arm,error []string{"Script exited with non-zero exit status: 1.Allowed exit codes are: [0]"}
2021/03/10 11:38:54 ui error: --> azure-arm: Script exited with non-zero exit status: 1.Allowed exit codes are: [0]
2021/03/10 11:38:54 ui: 
==> Builds finished but no artifacts were created.

Logs after successful generation:

2021/03/10 12:22:19 ui: ==> vhd: Querying the machine's properties ...
2021/03/10 12:22:19 ui: ==> vhd:  -> ResourceGroupName : '15944470_AzP.20210310.win16.99'
2021/03/10 12:22:19 ui: ==> vhd:  -> ComputeName       : 'pkrvmj9y2hptl66'
2021/03/10 12:22:19 ui: ==> vhd:  -> OS Disk           : 'https://xxxxxxxx.blob.core.windows.net/images/pkrosj9y2hptl66.vhd'
2021/03/10 12:22:19 ui: ==> vhd: Querying the machine's additional disks properties ...
2021/03/10 12:22:19 ui: ==> vhd:  -> ResourceGroupName : '15944470_AzP.20210310.win16.99'
2021/03/10 12:22:19 ui: ==> vhd:  -> ComputeName       : 'pkrvmj9y2hptl66'
2021/03/10 12:22:19 ui: ==> vhd: Powering off machine ...
2021/03/10 12:22:19 ui: ==> vhd:  -> ResourceGroupName : '15944470_AzP.20210310.win16.99'
2021/03/10 12:22:19 ui: ==> vhd:  -> ComputeName       : 'pkrvmj9y2hptl66'
2021/03/10 12:23:56 ui: ==> vhd: Capturing image ...
2021/03/10 12:23:56 ui: ==> vhd:  -> Compute ResourceGroupName : '15944470_AzP.20210310.win16.99'
2021/03/10 12:23:56 ui: ==> vhd:  -> Compute Name              : 'pkrvmj9y2hptl66'
2021/03/10 12:23:56 ui: ==> vhd:  -> Compute Location          : 'centralus'
2021/03/10 12:24:56 ui: ==> vhd: Deleting the temporary Additional disk ...
2021/03/10 12:24:56 ui: ==> vhd: Removing the created Deployment object: 'pkrdpj9y2hptl66'
2021/03/10 12:25:11 ui: ==> vhd: Removing the created Deployment object: 'kvpkrdpj9y2hptl66'
 2021/03/10 12:25:26 ui: ==> vhd: 
 ==> vhd: Cleanup requested, deleting resource group ...
2021/03/10 12:27:11 ui: ==> vhd: Resource group has been deleted.
2021/03/10 12:27:11 [INFO] (telemetry) ending azure-arm
2021/03/10 12:27:11 ui: Build 'vhd' finished after 28 minutes 38 seconds.

Interesting :/

I’d only tested the Linux ones so far, guess something must be off in the windows flow. I’ll try that today and see what’s up.

(update: there's no os-specific difference, looking at the DeployTemplate step and corresponding cleanup)

@miketimofeev on an initial attempt to repro on a successful windows vhd, cleanup seems to work for me. did you verify the RG contents, besides the difference in log output? we did change the logic slightly from what was done before #9559, so the logs won't match.

It looks like your configuration uses temporary resource groups -- shouldn't the whole group be deleted? The disk deletion will be skipped in that case anyway per the logic in my PR, so don't expect a separate step to be executed. I'm trying to repro using your config but the temporary RG seems to be deleted fine.

@alexeldeib you're right, we use a temporary resource group during the image generation (all our configurations can be found here https://github.com/actions/virtual-environments/tree/main/images), and this group deletes successfully, with no issues at all. But since we use an unmanaged disk β€” it stores not in the resource group, but in the specified storage account, and the OS disk vhd is not deleted from the storage account after the build

Thanks for clarifying, apologies I'm a little rusty on unmanaged disks. I understand and got a repro, but it took a little time to figure out an appropriate place to fix. I think the latest version of #10713 should work, albeit with some changes around template cleanup. I think the artifacts from the latest build should bring back the expected behavior: https://app.circleci.com/pipelines/github/hashicorp/packer/9591/workflows/12365566-7a99-4cfa-b879-60430e29f0af/jobs/115023/artifacts

@alexeldeib Thank you! I'll try the new binaries and get back with the results.

@alexeldeib tested windows binary on both cases β€” successful and failed, and everything worked like a charm! The disk was deleted along with the RG πŸ₯³

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.