Possible issue with multiple services using the alert definition in the terraform modules

Question

Possible issue with multiple services using the alert definition in the terraform modules

Closed this issue 3 years ago · 13 comments

My terraform setup looks as below -

logzio
     - modules
              - alerts
                     main.tf
                     outputs.tf
                     README.md
                     undesired_exception.tf
                     variables.tf
    myservice1.tf
    myservice2.tf
    main.tf
    outputs.tf
    tfstate-backend-config.tf
    variables.tf

logzio/modiles/alerts/main.tf

Its empty as of now.

logzio/modiles/alerts/outputs.tf

output "alert_undesired_exception" {
  value       = logzio_alert.undesired_exception.id
  description = "This is the reference to all the undesired exception alert created"
}

logzio/modiles/alerts/undesired_exception.tf

resource "logzio_alert" "undesired_exception" {
  count = var.alert_undesired_exception_enabled ? 1 : 0
  title = "${var.region_name}.${var.env}.${var.service_group_name}.${var.service_name}-undesired-exception"
  description = "Triggers alert if unhandled exceptions are logged"
  query_string = "some query"
  filter = "some filter"
  operation = "GREATER_THAN"
  notification_emails = ["${var.target_warn_email_id}"]
  search_timeframe_minutes = 5
  value_aggregation_type = "NONE"
  alert_notification_endpoints = ["${var.target_on_call_id}"]
  suppress_notifications_minutes = 5
  severity_threshold_tiers {
    severity = "HIGH"
    threshold = var.alert_undesired_exception_warn_min_threshold_value
  }
}

logzio/modiles/alerts/variables.tf

variable "alert_undesired_exception_enabled" {
  description = "A flag to enable alert related to exceptions logged multiple times"
  type = bool
  default = true
}
..
..
..

logzio/myservice1.tf

module "myservice1-alerts" {
  source = "./modules/alerts"

  env = var.environment_name
  region_name = var.aws_region
  service_group_name = var.service_group_name
  service_name = "myservice1"
  target_on_call_id = logzio_endpoint.target_on_call.id
  target_warn_email_id = var.warn_email_recepient

  alert_undesired_exception_enabled = var.alert_undesired_exception_enabled
  alert_undesired_exception_warn_min_threshold_value = var.alert_undesired_exception_warn_min_threshold_value
}

logzio/myservice2.tf

module "myservice2-alerts" {
  source = "./modules/alerts"

  env = var.environment_name
  region_name = var.aws_region
  service_group_name = var.service_group_name
  service_name = "myservice2"
  target_on_call_id = logzio_endpoint.target_on_call.id
  target_warn_email_id = var.warn_email_recepient

  alert_undesired_exception_enabled = var.alert_undesired_exception_enabled
  alert_undesired_exception_warn_min_threshold_value = 
  var.alert_undesired_exception_warn_min_threshold_value
}

We are trying to create alerts using terraform modules.
The terraform state is in AWS S3.
I will not talk about how S3 config in this issue, unless you need it. I do see the state is successfully saved and retrieved.

When I finally run the terraform init, plan, apply. I get the following error.

The error logs is -

▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ Terraform Apply ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
logzio_endpoint.target_on_call: Creating...
logzio_endpoint.target_on_call: Creation complete after 0s [id=11506]
module.myservice1-alerts.logzio_alert.undesired_exception[0]: Creating...
module.myservice2-alerts.logzio_alert.undesired_exception[0]: Creating...
module.myservice1-alerts.logzio_alert.undesired_exception[0]: Creation complete after 1s [id=637482]

Error: API call GetAlert failed with status code 404, data: "no alert id 637480 found for account 122848"

  on modules/alerts/undesired_exception.tf line 1, in resource "logzio_alert" "undesired_exception":
   1: resource "logzio_alert" "undesired_exception" {


▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
Failed To Run Terraform Apply!
2020/04/07 17:40:10 Apply Error: Failed to run Terraform command: exit status 1

What is interesting is, even though it failed, it has actually created required alerts definition for both the services and alert endpoints in logzio.
It failed when doing the GET for one of them.
The terraform state in S3 clearly has information of all 2 alert definition (along with its id) and alert endpoint.

Re-run the same terraform commands, it succeeds as below

▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ Terraform Apply ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
module.myservice2-alerts.logzio_alert.undesired_exception[0]: Destroying... [id=637486]
module.myservice2-alerts.logzio_alert.undesired_exception[0]: Destruction complete after 1s
module.myservice2-alerts.logzio_alert.undesired_exception[0]: Creating...
module.myservice2-alerts.logzio_alert.undesired_exception[0]: Creation complete after 0s [id=637505]

Apply complete! Resources: 1 added, 0 changed, 1 destroyed.
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲

Pls let me know if you need more data.

Answer 1 · 2020-04-24T18:56:18.000Z

@jonboydell ... any inputs?

Answer 2 · 2020-04-28T05:58:11.000Z

@jonboydell ... I did some troubleshooting and the cause of the issue is intermittent.
In my example above, we are creating 3 resources in parallel.

Though logzio APIs responds back saying the resource is created (as shown in below snippet), when we immediately fire a GET alert definition soon after, might return 404 intermittently (may be it is not operable yet).

In the file resource_alert.go

func resourceAlertCreate(d *schema.ResourceData, m interface{}) error {
  ...
  ...
  a, err := client.CreateAlert(createAlert)
  ...
  ...
   return resourceAlertRead(d, m)
}

In my example, I had 3 services using the same terraform module to create alerts.
I am wondering, why was this not reported in the past!!! Am I the first to encounter till now!!!

I also tried the below

  provisioner "local-exec" {
    command = "echo 'Resource created ${var.service_name}' && sleep 5"
  }

But later realized, the above block gets executed only when the resourceAlertCreate finishes its work.
I then altered the golang code, added a sleep time just before calling resourceAlertRead. From there on, i see success.

Answer 3 · 2020-04-28T06:01:04.000Z

@pengux and @islomar ... any inputs from your end for the above finding?

Answer 4 · 2020-04-28T08:49:58.000Z

No, I’ve seen that issue before, I just didn’t know how to fix it and didn’t have anyone at Logz.io to talk to about it. There’s no indication in the API docs that resources are “lazy” created. Ta Jon From: nischit7 <notifications@github.com> Reply to: jonboydell/logzio_terraform_provider <reply@reply.github.com> Date: Tuesday, 28 April 2020 at 06:58 To: jonboydell/logzio_terraform_provider <logzio_terraform_provider@noreply.github.com> Cc: jonboydell <jonboydell@hotmail.com>, Mention <mention@noreply.github.com> Subject: Re: [jonboydell/logzio_terraform_provider] Possible issue with multiple services using the alert definition in the terraform modules (#36) @jonboydell<https://github.com/jonboydell> ... I did some troubleshooting and the cause of the issue is intermittent. In my example above, we are creating 3 resources in parallel. Though logzio APIs responds back saying the resource is created (as shown in below snippet), when we immediately fire a GET alert definition soon after, might return 404 intermittently. In the file resource_alert.go func resourceAlertCreate(d *schema.ResourceData, m interface{}) error { ... ... a, err := client.CreateAlert(createAlert) ... ... return resourceAlertRead(d, m) } In my example, I had 3 services using the same terraform module to create alerts. I am wondering, why was this not reported in the past!!! Am I the first to encounter till now!!! I also tried the below provisioner "local-exec" { command = "echo 'Resource created ${var.service_name}' && sleep 5" } But later realized, the above block gets executed only when the resourceAlertCreate finishes its work. I then altered the golang code, added a sleep time just before calling resourceAlertRead. From there on, i see success. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#36 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AABBJTBXXCR5TNMXO6FJFH3ROZV77ANCNFSM4MDKSNZQ>.

Answer 5 · 2020-04-28T21:24:40.000Z

I raised a ticket with logzio support. I will wait for their response.

Answer 6 · 2020-05-18T14:36:05.000Z

Logzio team are fixing their API to return 202 accepted, instead of 200 OK.
Their ticket for enhancement is https://logzio.atlassian.net/browse/DEV-19912. I dont have access to this ticket either.

Answer 7 · 2020-06-17T14:54:08.000Z

Encountering the same issue, would be better if the terraform module polled until the alert is actually created or something

Answer 8 · 2020-12-08T23:16:27.000Z

I am also encountering this issue. After running tf apply several times, it eventually works.

Answer 9 · 2020-12-13T10:41:53.000Z

We are debating on how we should approach this problem and soon start work on a fix. If anyone wants to try and move this forward, they are more than welcome to write their opinion here and contribute to the solution.

Answer 10 · 2020-12-13T19:38:46.000Z

@yyyogev - does the logz.io API create resources (alerts, in this case) asynchronously? It maybe that Get Alert needs to poll the logs.io API endpoint until the TF provider gets an answer, would delay creating the alert but would be consistent with the logz.io back end. I think this is how the AWS provider creates CloudFront resources...

Answer 11 · 2021-05-25T15:45:29.000Z

Hi @nischit7, the latest provider version (v1.2) has support for Logzio's Alerts V2 API, which should solve your issue.
Please let us know if you're having further issues.

Answer 12 · 2021-06-06T16:27:17.000Z

Hi @nischit7 , @amarine7882 , @dhoepelman ?
Did the latest version fix this issue for you?

Thanks

Answer 13 · 2021-06-07T18:21:14.000Z

This issue has been resolved for my team by using the new V2 alert. Thanks!