cloudfoundry/csb-brokerpak-azure

[BUG] Plan upgrades for Azure SQL Failover group fail or partially complete

claassen opened this issue ยท 18 comments

Describe the bug
When upgrading Azure SQL failover group service instance plans we noticed 2 cases where things do not complete as expected:

  1. Upgrading from Basic tier plan to Standard tier plan always fails with error:

Error: waiting for creation of MsSql Database "csb-fog-db-5d4a9d17-2252-4561-b800-7c7817503682" (MsSql Server Name "azugessqlprod" / Resource Group "MFC-CAC-PCF-PSB"): Code="SourceDatabaseEditionCouldNotBeUpgraded" Message="The source database
'azugessqlprod.csb-fog-db-5d4a9d17-2252-4561-b800-7c7817503682' cannot have higher edition than the target database 'azugessqlprodcae.csb-fog-db-5d4a9d17-2252-4561-b800-7c7817503682'. Upgrade the edition on the target before upgrading source." on main.tf line
15, in resource "azurerm_mssql_database" "primary_db": 15: resource "azurerm_mssql_database" "primary_db" { exit status 1

  1. Upgrading within the same tier plan completes successfully, and updates the primary database instance, but does not update the secondary database instance.

To Reproduce
Steps to reproduce the behavior:

  1. cf create-service csb-azure-mssql-db-failover-group Basic basic-db

  2. cf update-service basic-db -p StandardS0

  3. cf create-service csb-azure-mssql-db-failover-group StandardS0 standard-db

  4. cf update-service standard-db -p StandardS1

  5. Verify in Azure that both databases have been updated

Expected behavior
Plans updates should complete successfully and both databases in the failover group should be updated to the specified plan in Azure.

Additional context
BROKER_VERSION=0.2.3
BROKERPAK_VERSION=1.0.0-rc.38

According to Microsoft, when upgrading the service plan for databases in failover groups, the secondary database should be upgraded first, followed by the primary database. Possible the broker is not following this order leading to these issues.

For the case where the update fails completely when upgrading from Basic to Standard tier plan, we were able to successfully update the plans directly in Azure afterwords by updating the secondary DB first.

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/177493831

The labels on this github issue will be updated when the story is started.

Some more information:

If I create a new FOG DB service instance and try to update the plan, it will only update the plan for the primary DB in Azure. If I then manually update the plan for the secondary DB to match, and then create a new subsume instance for the existing DB, a subsequent plan update works successfully and updates the plans for both DBs in Azure.

I'm not sure how this could be accomplished in terraform - during creation, the primary has to be created first, then the secondary is created referencing the primary. During update, that order would have to be reversed which seems impossible.

Also, be careful updating across sku families, they are destructive in some cases.

We'll do some investigation and see how this can be improved.

@erniebilling A similar mechanism is implemented in Terraform for Postgres here

I don't see any way to make this work in terraform. The problem only occurs when you try to upgrade across DTU families (basic to standard for instance.) If I try to make the secondary database depend on the primary (so it updates first) terraform fails with a circular dependency. Best option at this point is to only offer plans within the same family, so upgrades can function. Other option would be for azurerm provider to be aware of what is going on and do things in the right order, but I'm not sure if that is possible.

@erniebilling the easiest way to fix this would be to implement it in the azure terraform provider. As mentioned before, I've implemented a similar pattern for Postgres!

Please file an issue like this one, I am interested in fixing it.

@erniebilling Even plan updates within the same family don't work as expected though, it only updates the primary DB.

@claassen a bug in the azurerm provider prevented updating the secondary db size, it looks like it has been fixed and will be included in the next release.

blgm commented

Issue hashicorp/terraform-provider-azurerm#11282 has been closed, so I've taken a look to see whether this moves things forward.

TLDR: it's fixed

I was able to re-create hashicorp/terraform-provider-azurerm#11282 using the following commands. There seems to have been a change in the names of the plans since this issue was raised. Also the error message that I re-created was the same as in hashicorp/terraform-provider-azurerm#11282, but not precisely the same as at the top of this issue.

$ cf create-service csb-azure-mssql-failover-group small foggy -b csb-root # OK
$ cf update-service foggy -p medium # eventually failed
$ cf service foggy
Showing info of service foggy in org pivotal / space acceptance-test as admin...

name:             foggy
service:          csb-azure-mssql-failover-group
tags:
plan:             small
description:      Manages auto failover group for managed Azure SQL on the Azure Platform
documentation:    https://docs.microsoft.com/en-us/azure/sql-database/sql-database-auto-failover-group/
dashboard:
service broker:   csb-root

Showing status of last operation from service foggy...

status:    update failed
message:   Error: waiting for creation of MsSql Database "csb-db" (MsSql Server Name "csb-azsql-fog-ea6aa16b-b1dd-464d-a735-5e133be3eb9e-secondary" / Resource Group
           "broker-cf-test"): Code="InvalidOperationForDatabaseInReplicationRelationship" Message="The operation cannot be performed since the database 'csb-db' is in a
           replication relationship."  on brokertemplate/definition.tf line 152, in resource "azurerm_mssql_database" "secondary_azure_sql_db": 152: resource
           "azurerm_mssql_database" "secondary_azure_sql_db" { exit status 1
started:   2021-04-30T11:26:07Z
updated:   2021-04-30T11:28:13Z

There are no bound apps for this service.

Upgrades are not supported by this broker.

In a development environment I updated the Terraform provider (azurerm) to version 2.57.0 (which included the fix for hashicorp/terraform-provider-azurerm#11282). I also had to remove the Terraform line that sets the size of the secondary DB. This is because the size of the secondary DB should be computed, and the Terraform provider 2.57.0 now fails when this is specified. Line removal diff:

@@ -155,7 +155,6 @@ resource "azurerm_mssql_database" "secondary_azure_sql_db" {
   sku_name            = local.sku_name
   tags                = var.labels
   create_mode         = "Secondary"
-  max_size_gb         = var.max_storage_gb
   creation_source_database_id  = azurerm_mssql_database.azure_sql_db.id
 }

After building the brokerpak and deploying, I was unable to reproduce the issue using the steps documented above. Previously the error occurred 3 times in 10 updates, and now it failed 0 times in 10 updates. This is suggestive that the latest version of the Terraform provider has fixed the issue. The next steps will be to release a new version of the brokerpak.

Thanks @blgm! Nice to see that the work payed off ๐Ÿ˜„

@blgm are the azure provider 2.57, and the other changes merged?

blgm commented

Hi @alex-tw-lam, they are not yet.

@blgm can we keep the issue open? We want to be notified when a fix is merged.

Upgrading within the same plan is now fixed in terms of actually updating both the primary and secondary DB instance, but still seeing issues upgrading from Basic to Standard tier with the latest release (1.0.0-rc39):

status:    update failed
message:   Error: waiting for creation of MsSql Database "csb-fog-db-bf834bab-cca9-4c3d-bcc6-1370cd561773" (MsSql Server Name "azugessqlsandbox" / Resource Group "MFC-USE-PCF-PSB"): Code="SourceDatabaseEditionCouldNotBeUpgraded" Message="The source database
           'azugessqlsandbox.csb-fog-db-bf834bab-cca9-4c3d-bcc6-1370cd561773' cannot have higher edition than the target database 'azugessqlsandboxdr.csb-fog-db-bf834bab-cca9-4c3d-bcc6-1370cd561773'. Upgrade the edition on the target before upgrading source."  on main.tf
           line 15, in resource "azurerm_mssql_database" "primary_db":  15: resource "azurerm_mssql_database" "primary_db" { exit status 1

Yeah, tested that as well and didn't implement the solution yet, as it's not clear whether Azure supports it: https://docs.microsoft.com/en-us/azure/azure-sql/database/active-geo-replication-overview#upgrading-or-downgrading-primary-database.

If you have a working example how to do it with az cli, I might spend some time on implementing it.

@aristosvo I'm not familiar with the internal workings of Terraform to know if how feasible this is or not, but it seems like the solution should be as simple as just ensuring the secondary DB is updated before the primary. Our workaround has been to manually update the secondary DB in Azure, followed by the primary in order to do plan updates across tiers.

Okay, looks like I have a solution here, but I need to test this extensively as this may contain a few edge cases.

I'm especially not experienced with elastic pools, so if you have experience scaling from single database to pools in FOG, let me know.

blgm commented

I've created a new issue #47 to continue the conversation since this issue was closed.