Error: Unsupported attribute when attempting to destroy
Closed this issue · 13 comments
Describe the bug
I followed the tutorial video https://github.com/GoogleCloudPlatform/hpc-toolkit/tree/v1.9.0/docs/videos/build-your-own-blueprint and made only small adjustments, not including the rocky linux image (check blueprint below).
After testing that the image was functioning, I attempted to use ./ghpc destroy <deployment_name_dir>
and got
Steps to reproduce
Steps to reproduce the behavior:
- ./ghpc create genoslurm_blueprint.yaml
- submit test job:
gcloud batch jobs submit batch-job-3ad191a4 --config=/path/to/batch-job.yaml --location=us-central1 --project=<project>
- ./ghpc destroy genoslurm-batch-us-central1
Expected behavior
deployment properly shut down and destroyed
Actual behavior
the object does not have an attribute named "job_data"
Version (ghpc --version
)
ghpc version v1.36.1
Built from 'main' branch.
Commit info: v1.36.1-0-g493308e7
Blueprint
blueprint_name: genoslurm-blueprint
vars:
project_id: <project>
deployment_name: genoslurm-batch-us-central1
region: us-central1
zone: us-central1-a
deployment_groups:
- group: primary
modules:
- id: genoslurm-network-us-central1
source: modules/network/vpc
- id: appfs
source: modules/file-system/filestore
use: [genoslurm-network-us-central1]
settings:
local_mount: /apps
- id: lustrefs
source: community/modules/file-system/DDN-EXAScaler
use: [genoslurm-network-us-central1]
settings: {local_mount: /scratch}
- id: batch-job
source: modules/scheduler/batch-job-template
use: [genoslurm-network-us-central1, appfs, lustrefs]
settings:
runnable: "echo 'hello world'"
machine_type: n2-standard-4
outputs: [instructions]
- id: batch-login
source: modules/scheduler/batch-login-node
use: [batch-job]
outputs: [instructions]
Expanded Blueprint
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
blueprint_name: genoslurm-blueprint
ghpc_version: v1.36.1-0-g493308e7
vars:
deployment_name: genoslurm-batch-us-central1
labels:
ghpc_blueprint: genoslurm-blueprint
ghpc_deployment: ((var.deployment_name))
project_id: <project>
region: us-central1
zone: us-central1-a
deployment_groups:
- group: primary
terraform_providers:
google:
source: hashicorp/google
version: '>= 4.84.0, < 5.32.0'
configuration:
project: ((var.project_id))
region: ((var.region))
zone: ((var.zone))
google-beta:
source: hashicorp/google-beta
version: '>= 4.84.0, < 5.32.0'
configuration:
project: ((var.project_id))
region: ((var.region))
zone: ((var.zone))
modules:
- source: modules/network/vpc
kind: terraform
id: genoslurm-network-us-central1
settings:
deployment_name: ((var.deployment_name))
project_id: ((var.project_id))
region: ((var.region))
- source: modules/file-system/filestore
kind: terraform
id: appfs
use:
- genoslurm-network-us-central1
settings:
deployment_name: ((var.deployment_name))
labels: ((var.labels))
local_mount: /apps
network_id: ((module.genoslurm-network-us-central1.network_id))
project_id: ((var.project_id))
region: ((var.region))
zone: ((var.zone))
- source: community/modules/file-system/DDN-EXAScaler
kind: terraform
id: lustrefs
use:
- genoslurm-network-us-central1
settings:
labels: ((var.labels))
local_mount: /scratch
network_self_link: ((module.genoslurm-network-us-central1.network_self_link))
project_id: ((var.project_id))
subnetwork_address: ((module.genoslurm-network-us-central1.subnetwork_address))
subnetwork_self_link: ((module.genoslurm-network-us-central1.subnetwork_self_link))
zone: ((var.zone))
- source: modules/scheduler/batch-job-template
kind: terraform
id: batch-job
use:
- genoslurm-network-us-central1
- appfs
- lustrefs
outputs:
- name: instructions
settings:
deployment_name: ((var.deployment_name))
job_id: batch-job
labels: ((var.labels))
machine_type: n2-standard-4
network_storage: ((flatten([module.lustrefs.network_storage, flatten([module.appfs.network_storage])])))
project_id: ((var.project_id))
region: ((var.region))
runnable: echo 'hello world'
subnetwork: ((module.genoslurm-network-us-central1.subnetwork))
- source: modules/scheduler/batch-login-node
kind: terraform
id: batch-login
use:
- batch-job
outputs:
- name: instructions
settings:
deployment_name: ((var.deployment_name))
gcloud_version: ((module.batch-job.gcloud_version))
instance_template: ((module.batch-job.instance_template))
job_data: ((flatten([module.batch-job.job_data])))
labels: ((var.labels))
network_storage: ((flatten([module.batch-job.network_storage])))
project_id: ((var.project_id))
region: ((var.region))
startup_script: ((module.batch-job.startup_script))
zone: ((var.zone))
Output and logs
Testing if deployment group genoslurm-batch-us-central1/primary requires destroying cloud infrastructure
failed to destroy group "primary":
Error: exit status 1
Error: Unsupported attribute
on main.tf line 64, in module "batch-login":
64: job_data = flatten([module.batch-job.job_data])
├────────────────
│ module.batch-job is object with 3 attributes
This object does not have an attribute named "job_data".
Hint: terraform plan for deployment group genoslurm-batch-us-central1/primary failed
destruction of "genoslurm-batch-us-central1" failed
Execution environment
- OS: macOS
- Shell (To find this, run
ps -p $$
): /bin/sh - go version: go version go1.22.5 darwin/arm64
Additional context
Apologies if this is a simple misunderstanding rather than a bug!
I'll take a look and try to reproduce the issue.
Did you deploy the blueprint after you created it?
./ghpc deploy genoslurm-batch-us-central1 --auto-approve
yes, apologies, I forgot to include that in the steps above. it is currently deployed and I have successfully run jobs
I haven't been able to reproduce the issue, but I will continue trying. In the meantime, could you try removing the resources in your project by hand, recreating the deployment folder (perhaps with a different deployment name) and trying to deploy and destroy again?
Another place to look is in genoslurm-batch-us-central1/primary/modules/embedded/modules/scheduler/batch-job-template/outputs.tf
and see if the job_data
output exists there. That's what the issue you posted is complaining about.
done. job_data exists in outputs.tf as it did in the previous test:
output "job_data" {
description = "All data associated with the defined job, typically provided as input to clout-batch-login-node."
value = {
template_contents = local.job_template_contents,
filename = local.job_filename,
id = local.submit_job_id
}
}
does this need to be a list?
Thanks, one more quick question before I try and dive deeper. Could you try cloning a clean copy of the repository, building ghpc, and running the same steps with the original blueprint? The only thing you should change is the project name.
I took a look at the code and nothing obvious stood out. This message seems a bit odd to me module.batch-job
is object with 3 attributesbecause
batch-jobis the sub-module and should have more than 3 attributes. Really it is
module.batch-job.job_data` that should have 3 attributes. But maybe this is just terraform having a bit of a weird error.
@dvitale199, could you update your blueprint to print the job_data
output. To do this you would add job_data
to the outputs list of the batch-job (as shown below). Then when you call ghpc deploy
it should print an output at the end containing the contents of job_data
. There might be something interesting in there, like a null value that should be populated.
- id: batch-job
source: modules/scheduler/batch-job-template
use: [genoslurm-network-us-central1, appfs, lustrefs]
settings:
runnable: "echo 'hello world'"
machine_type: n2-standard-4
outputs: [instructions, job_data]
The other interesting thing is the error message says Hint: terraform plan for deployment group genoslurm-batch-us-central1/primary failed
. I am not sure if it is possible to print out the plan. It might be that it is not possible since it is says it is failing to generate a plan. To do this you would call ghpc destroy
(no --auto-approve) and then when prompted select the display
option (d
). There might be a clue in there.
Ok I've done this but when I use ghpc destroy, it does not prompt me to display, it just fails with the same error. I've attached the display from the deploy command below:
genoslurm-batch-test-create-display.txt
I've also tried this using the gcluster command instead of ghpc. unsure if there is a difference but got the same result. going to try one of the example configs and see if there's any difference.
Is there a potential path issue here? I store my .yaml in cluster-toolkit/ and run everything with ./ghpc from. cluster-toolkit.
I'm very sorry for the trouble. I believe everything is working alright besides the destroy, which is only a minor inconvenience for testing. If I figure anything out I will comment back. I appreciate the help.
No worries about the trouble. I'll keep looking into this.
Which of the things did you try from @nick-stroud and my responses?
Another quick question is: which version of Terraform are you using?
I tried adding the job_data to outputs of batch-job for which I attached the output above. I also tried running ghpc destroy with no --auto-approve but it fails before the option to display is given.
I have Terraform v1.3.7 on darwin_arm64
@dvitale199 A few questions that might lead to some clues:
- When recreating the deployment folder in your second test, did you use the
-w
flag, or did you delete the deployment folder before callingghpc create
? - When you re-created the deployment folder, which version of ghpc did you use?
- Do you see the
job_data
variable in this file in your deployment folder?genoslurm-batch-us-central1/primary/modules/embedded/modules/scheduler/batch-login-node/variables.tf
@dvitale199 A few questions that might lead to some clues:
- When recreating the deployment folder in your second test, did you use the
-w
flag, or did you delete the deployment folder before callingghpc create
?- When you re-created the deployment folder, which version of ghpc did you use?
- Do you see the
job_data
variable in this file in your deployment folder?genoslurm-batch-us-central1/primary/modules/embedded/modules/scheduler/batch-login-node/variables.tf
- I deleted the previous directory and recreated with
ghpc create
- latest version from github
- yes, job_data was there. I cannot share it because I deleted it by mistake for further testing
since my last comment, I've pulled a fresh clone and just tested with the examples/hpc-slurm.yaml and have not run into any issues creating and destroying the config.
I'm wondering if passing use: [genoslurm-network-us-central1, appfs, lustrefs] to the batch nodeset had anything to do with it?
are use: [genoslurm-network-us-central1, appfs, lustrefs]
and
use:
- genoslurm-network-us-central1
- appfs
- lustrefs
synonymous?
thanks for all your help.
That's great to hear you were able to get the expected behavior from a fresh clone!
Regarding the use
block - yes, syntax in both of your examples are synonymous. They are just two different ways to represent a list in YAML.
I will close this issue since it appears to be resolved, but please feel free to reopen or create a new issue if you encounter other problems!