logzio/terraform-provider-logzio

Possible issue with multiple services using the alert definition in the terraform modules

Closed this issue · 13 comments

My terraform setup looks as below -

logzio
     - modules
              - alerts
                     main.tf
                     outputs.tf
                     README.md
                     undesired_exception.tf
                     variables.tf
    myservice1.tf
    myservice2.tf
    main.tf
    outputs.tf
    tfstate-backend-config.tf
    variables.tf

logzio/modiles/alerts/main.tf

Its empty as of now.

logzio/modiles/alerts/outputs.tf

output "alert_undesired_exception" {
  value       = logzio_alert.undesired_exception.id
  description = "This is the reference to all the undesired exception alert created"
}

logzio/modiles/alerts/undesired_exception.tf

resource "logzio_alert" "undesired_exception" {
  count = var.alert_undesired_exception_enabled ? 1 : 0
  title = "${var.region_name}.${var.env}.${var.service_group_name}.${var.service_name}-undesired-exception"
  description = "Triggers alert if unhandled exceptions are logged"
  query_string = "some query"
  filter = "some filter"
  operation = "GREATER_THAN"
  notification_emails = ["${var.target_warn_email_id}"]
  search_timeframe_minutes = 5
  value_aggregation_type = "NONE"
  alert_notification_endpoints = ["${var.target_on_call_id}"]
  suppress_notifications_minutes = 5
  severity_threshold_tiers {
    severity = "HIGH"
    threshold = var.alert_undesired_exception_warn_min_threshold_value
  }
}

logzio/modiles/alerts/variables.tf

variable "alert_undesired_exception_enabled" {
  description = "A flag to enable alert related to exceptions logged multiple times"
  type = bool
  default = true
}
..
..
..

logzio/myservice1.tf

module "myservice1-alerts" {
  source = "./modules/alerts"

  env = var.environment_name
  region_name = var.aws_region
  service_group_name = var.service_group_name
  service_name = "myservice1"
  target_on_call_id = logzio_endpoint.target_on_call.id
  target_warn_email_id = var.warn_email_recepient

  alert_undesired_exception_enabled = var.alert_undesired_exception_enabled
  alert_undesired_exception_warn_min_threshold_value = var.alert_undesired_exception_warn_min_threshold_value
}

logzio/myservice2.tf

module "myservice2-alerts" {
  source = "./modules/alerts"

  env = var.environment_name
  region_name = var.aws_region
  service_group_name = var.service_group_name
  service_name = "myservice2"
  target_on_call_id = logzio_endpoint.target_on_call.id
  target_warn_email_id = var.warn_email_recepient

  alert_undesired_exception_enabled = var.alert_undesired_exception_enabled
  alert_undesired_exception_warn_min_threshold_value = 
  var.alert_undesired_exception_warn_min_threshold_value
}

We are trying to create alerts using terraform modules.
The terraform state is in AWS S3.
I will not talk about how S3 config in this issue, unless you need it. I do see the state is successfully saved and retrieved.

When I finally run the terraform init, plan, apply. I get the following error.

The error logs is -

▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ Terraform Apply ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
logzio_endpoint.target_on_call: Creating...
logzio_endpoint.target_on_call: Creation complete after 0s [id=11506]
module.myservice1-alerts.logzio_alert.undesired_exception[0]: Creating...
module.myservice2-alerts.logzio_alert.undesired_exception[0]: Creating...
module.myservice1-alerts.logzio_alert.undesired_exception[0]: Creation complete after 1s [id=637482]

Error: API call GetAlert failed with status code 404, data: "no alert id 637480 found for account 122848"

  on modules/alerts/undesired_exception.tf line 1, in resource "logzio_alert" "undesired_exception":
   1: resource "logzio_alert" "undesired_exception" {


▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲
Failed To Run Terraform Apply!
2020/04/07 17:40:10 Apply Error: Failed to run Terraform command: exit status 1

What is interesting is, even though it failed, it has actually created required alerts definition for both the services and alert endpoints in logzio.
It failed when doing the GET for one of them.
The terraform state in S3 clearly has information of all 2 alert definition (along with its id) and alert endpoint.

Re-run the same terraform commands, it succeeds as below

▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ Terraform Apply ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼ ▼
module.myservice2-alerts.logzio_alert.undesired_exception[0]: Destroying... [id=637486]
module.myservice2-alerts.logzio_alert.undesired_exception[0]: Destruction complete after 1s
module.myservice2-alerts.logzio_alert.undesired_exception[0]: Creating...
module.myservice2-alerts.logzio_alert.undesired_exception[0]: Creation complete after 0s [id=637505]

Apply complete! Resources: 1 added, 0 changed, 1 destroyed.
▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ Terraform Apply ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲ ▲

Pls let me know if you need more data.

@jonboydell ... any inputs?

@jonboydell ... I did some troubleshooting and the cause of the issue is intermittent.
In my example above, we are creating 3 resources in parallel.

Though logzio APIs responds back saying the resource is created (as shown in below snippet), when we immediately fire a GET alert definition soon after, might return 404 intermittently (may be it is not operable yet).

In the file resource_alert.go

func resourceAlertCreate(d *schema.ResourceData, m interface{}) error {
  ...
  ...
  a, err := client.CreateAlert(createAlert)
  ...
  ...
   return resourceAlertRead(d, m)
}

In my example, I had 3 services using the same terraform module to create alerts.
I am wondering, why was this not reported in the past!!! Am I the first to encounter till now!!!

I also tried the below

  provisioner "local-exec" {
    command = "echo 'Resource created ${var.service_name}' && sleep 5"
  }

But later realized, the above block gets executed only when the resourceAlertCreate finishes its work.
I then altered the golang code, added a sleep time just before calling resourceAlertRead. From there on, i see success.

@pengux and @islomar ... any inputs from your end for the above finding?

I raised a ticket with logzio support. I will wait for their response.

Logzio team are fixing their API to return 202 accepted, instead of 200 OK.
Their ticket for enhancement is https://logzio.atlassian.net/browse/DEV-19912. I dont have access to this ticket either.

Encountering the same issue, would be better if the terraform module polled until the alert is actually created or something

I am also encountering this issue. After running tf apply several times, it eventually works.

We are debating on how we should approach this problem and soon start work on a fix. If anyone wants to try and move this forward, they are more than welcome to write their opinion here and contribute to the solution.

@yyyogev - does the logz.io API create resources (alerts, in this case) asynchronously? It maybe that Get Alert needs to poll the logs.io API endpoint until the TF provider gets an answer, would delay creating the alert but would be consistent with the logz.io back end. I think this is how the AWS provider creates CloudFront resources...

Hi @nischit7, the latest provider version (v1.2) has support for Logzio's Alerts V2 API, which should solve your issue.
Please let us know if you're having further issues.

Hi @nischit7 , @amarine7882 , @dhoepelman ?
Did the latest version fix this issue for you?

Thanks

This issue has been resolved for my team by using the new V2 alert. Thanks!