civo/terraform-provider-civo

[BUG] Using a wrong cidr_v4 does not get validated

Closed this issue · 5 comments

Issues

Using a completely wrong cidr_v4 causes terraform to keep trying to provision this resource for two minutes rather than failing straight away:

resource "civo_network" "volume-issue" {
  label = "blah-network"
  cidr_v4 = "sfdsdfsdf"

}
image

This should fail straight away as it does on the API, CLI, etc.

Acceptance Criteria

  • Have this resource fail straight away when entering the wrong value.

This ticket should be used as a reference for a bigger piece of work revamping the way we handle errors in terraform. The default behaviour with our provider seems to be to keep trying and it does NOT return the error the API gives, which is a problem.

We need to go through the entire provider and change the behaviour for everything not by writing error messages in terraform provider, but returning errors from the API

Hi team,

I've been working on this issue and would like to share my approach for your review.

Outcome:

  1. Improved error handling.
  2. This resolves issues 269, 263, 256, and 249. (Screenshots attached below)

Approach:

  1. Retry Mechanism: I have Implemented a centralized retry mechanism using the Terraform Plugin SDK's Retry package within our utils. so, it can be used across the provider.

  2. Error Classification: Civo API returns more than 240 different types of errors. So, I have categorized the API errors into Retryable and Non-Retryable groups. This definitely is not an accurate categorization, but I have tried to be lenient to strike a balance. This approach is extensible and errors can anytime be added or removed from the group without affecting the core functionality.

  3. Error Message Parsing: We get error from the civo API in the following form:

Error: Unknown error response - status: 400 Bad Request, code: 400, reason: {"code":"invalid_v4_cidr_provided","reason":"The provided v4 CIDR must be a valid CIDR with a subnet mask comprehended between /22 and /28"}

So, I have tried to extract the Json part of it, and return only the code & reason. So, the user has enough information as to what went wong, and also the error message doesn't look cluttered.
Is this okay? or should I return the complete error?

Complete code can be seen in the gist Here.


Regarding Error Message Rewriting:

We need to go through the entire provider and change the behaviour for everything not by writing error messages in terraform provider, but returning errors from the API

I am a bit confused, would you like to refactor error handling throughout the provider as shown below?

Current:

kubeCluster, err := apiClient.FindKubernetesCluster(name.(string))
if err != nil {
	return diag.Errorf("[ERR] failed to retrive kubernetes cluster: %s", err)
}

Proposed:

kubeCluster, err := apiClient.FindKubernetesCluster(name.(string))
if err != nil {
	return diag.FromErr(err)
}


Screenshots for issues: 263, 256 & 249:

You can see them error-ing out, This prevents ghost resource creation and quota exhaustion.

Issue 263:

Screenshot 2024-07-22 155133

Issue 256:

Screenshot 2024-07-22 164114

Issue 249:

Screenshot 2024-07-22 153032

Issue 269:
Screenshot 2024-07-22 181416

Regards,
Praveen

Hii @Praveen005
Your approach looks fine to me. But a lil bit of correction. Here we decided to remove retry for all of the resources. So you just have to remove the retry mechanism wherever it is implemented and return the error that we got from the API in a clean format to the user.

Hi @uzaxirr,

Thanks for the input.

We currently have retries only in resource_instance.go & resource_network.go
I will replace the relevant part with the following:

        instance, err := apiClient.CreateInstance(config)
	if err != nil{
		customErr, parseErr := utils.ParseErrorResponse(err.Error())
		if parseErr == nil {
			err = customErr
		}
		return diag.Errorf("[ERR] failed to create an instance: %s", err)
	}

In resource_kubernetes_cluster_nodepool.go we have:

// Add retry logic here to delete the node pool
	err = retry.RetryContext(ctx, d.Timeout(schema.TimeoutDelete)-time.Minute, func() *retry.RetryError {
		_, err := apiClient.GetKubernetesClusterPool(getKubernetesCluster.ID, d.Id())
		if err != nil {
			if errors.Is(err, civogo.DatabaseClusterPoolNotFoundError) {
				log.Printf("[INFO] kubernetes node pool %s deleted", d.Id())
				return nil
			}
			log.Printf("[INFO] error trying to read kubernetes cluster pool: %s", err)
			return retry.NonRetryableError(fmt.Errorf("error waiting for Kubernetes node pool to be deleted: %s", err))
		}
		log.Printf("[INFO] kubernetes node pool %s still exists", d.Id())
		return retry.RetryableError(fmt.Errorf("kubernetes node pool still exists"))
	})
	if err != nil {
		return diag.FromErr(err)
	}

I feel we shouldn't delete this, many a times it errors out because of its dependency on other resources. or should we?

Also, is the format of error below okay?

Screenshot 2024-07-22 181416

Is there anything else, I should take care of?

@Praveen005 the above seems fine, can you raise a PR

Hi @uzaxirr,

I have raised the PR. When you have a moment, could you please review it?

Regards,
Praveen