Azure/azure-postgresql

az postgres flexible-server upgrade InternalServerError

Opened this issue ยท 35 comments

Hello,

I am trying to upgrade a postgres flexible-server from v11 to a newer version put I always get (InternalServerError) An unexpected error occured while processing the request. Tracking ID: '')

More specifically:

  • I have created a v11 postgres flexible-server
  • Activated the PG_BUFFERCACHE , PG_STAT_STATEMENTS, PLPGSQL and POSTGIS extensions
  • Migrated from a v11 postgres single-server
  • Made attempts to upgrade the flexible server from 11 to 12, 13, 14 and 15 from the portal and from the az-cli but all attempts failed with the InternalServerError error.

same issue here, support has been terribly bad at handling this case for us

Hello @charalamm I have the same issue with flexible-server from v11 to v14. Azure support doesn't help me

@charalamm have you managed to get some info about this?

@fsismondi I have talked with support and they said it is because of a bug on their end. They said they will fix it for the next release around mid November

hello, any update on this ?

Having the same issues upgrading from R12 to R15. Takes ages to resolve.

@fsismondi I have talked with support and they said it is because of a bug on their end. They said they will fix it for the next release around mid November

Did they say which year?

It's now April 2024, I just tried to upgrade 14 to 16:

{
  "code": "InternalServerError",
  "message": "An unexpected error occured while processing the request. Tracking ID: '3b54f416-f0a2-40a3-83ad-e9aa736f08ed'"
}

I had luck to do upgrade today, after 38 days with support ... Incredible service.

We are in the situation where we need to upgrade 10+ servers to v14. Create new ones and backup/restore would mean a lot of man hours, .

Disabling extensions seems not enough for making the procedure work.

Please somebody from @microsoftopensource , maybe @ramnov @rachel-msft @ambrahma can provide some info on what is happening here?
Support has been useless regarding this issue

@charalamm , @fsismondi we have a known issue of major upgrade failure due to timeouts when the server has large number of databases and/or schemas. We are working on a fixing on priority. Sorry for the inconvenience caused.
Can you raise a support ticket with your servers and share it here and I will personally follow up to make sure to address them ?

Mine is 2402280050000780 and I would like to know what happened and if it is resolved the way it will not happen again

Ours is 2312180050002992, this ticket was closed though.
Support was -put plainly- useless.
We are interested in knowing a safe and reproducible procedure we can follow start migrating all our servers.

Yeah, I'm currently doing manual migrations because of this, and it's a PITA

Same issue here ๐Ÿ‘

dont have a support ticket but i have 2 postgres instances which I'm unable to dump to 16. any ETA on the fix?

Apparently they are working on it at the moment. https://ruby.social/@clairegiordano@hachyderm.io/112254606338198662

Still not fixed after 8 months

After having contact with Azure Premium Support we were told they will fix this at the end of April. Its a Problem on Microsoft Side ๐Ÿ‘

This is a high level error with different underlying issues. Would request others as well to raise the support ticket to address them.
Add your ASC ticket here if you don't get traction and I will prioritize it.

I just tried to do an upgrade again and am still receiving the same error. Has there been any updates on the underlying issue?

We have contacted support and they have made something to our databases that had this issue that allowed upgrading. So I guess it's solvable via support

Just tested here and it worked.

We just ran into this tonight trying to upgrade 13 -> 16 and 13 -> 14, trying 13 -> 15 because we really need to fix a bug that was fixed in Postgres v14.

As a heads-up: I ran into this issue last week while trying to upgrade from Postgres 11 to 13. Azure support told me to, quote:

We request you to try the upgrade operation after following the below steps:
1. Please drop the extension โ€œhypopgโ€ in database level.
2. Then disable the โ€œhypopgโ€ extension in server. To disable follow the below:
Server parameter-->azure.extensions-->hypopg

After performing the above two steps, please try to upgrade the server and let us know if you face any issues.

And this worked fine. I was also able to re-install hypopg afterwards.

We also encountered this problem. We use the pgrouting extension which cannot always be upgraded. We had to do the following steps for the upgrade:

  1. Drop the pgrouting extension on databases
  2. Perform the database upgrade in Azure Portal
  3. Re-add the extension to the databases

Same issue here, even when testing with a brand new 12.19 psql database without any extensions. Tried multiple times, upgrade to 13, 14, 15 and 16 all failed.

I have a new case open for 2 servers (once v14, one v15) who both refuse to update.
They are currently escalating to the product team. We have enabled upgrade logs and get

The source cluster was not shut down cleanly. Failure, exiting

There are no extensions installed

FYI: we successfully migrated with the Terraform Provider from PG 11 to PG 16 this weekend ๐Ÿ˜„

I just tried it using Terraform as well, without success. Below is the full terraform code I used.
Applied it first with create_mode Default and version 12, made the indicated changes and applied again.

Terraform

terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~>3.116.0"
    }
    random = {
      source  = "hashicorp/random"
      version = "~>3.0"
    }
  }
}

provider "azurerm" {
  features {}
}


data "azurerm_resource_group" "rg" {
  name = "rg-wds-dev-weu-001"
}

resource "azurerm_postgresql_flexible_server" "psql_upgrade_test" {
  name                = "psql-wds-upgrade-test-terraform-001"
  resource_group_name = data.azurerm_resource_group.rg.name
  location            = "westeurope"

  backup_retention_days        = 7
  geo_redundant_backup_enabled = false
  create_mode                  = "Default" -> "Update"
  version                      = "12" -> "16"
  storage_mb                   = 32768
  sku_name                     = "B_Standard_B1ms"
  zone                         = 2

  administrator_login    = "thisismyadmin"
  administrator_password = "super-secret-password"

  public_network_access_enabled = true
}

resource "azurerm_postgresql_flexible_server_database" "psqldb_testdatabase" {
  name      = "testdatabase"
  server_id = azurerm_postgresql_flexible_server.psql_upgrade_test.id
  collation = "en_US.utf8"
  charset   = "utf8"
}

resource "azurerm_postgresql_flexible_server_firewall_rule" "psqlfr_azure_services" {
  name             = "Allow-public-azure-service-access"
  server_id        = azurerm_postgresql_flexible_server.psql_upgrade_test.id
  start_ip_address = "0.0.0.0"
  end_ip_address   = "0.0.0.0"
}

Output

โ•ท
โ”‚ Error: updating Flexible Server (Subscription: "<redacted>"
โ”‚ Resource Group Name: "rg-wds-dev-weu-001"
โ”‚ Flexible Server Name: "psql-wds-upgrade-test-terraform-001"): polling after Update: polling failed: the Azure API returned the following error:
โ”‚ 
โ”‚ Status: "InternalServerError"
โ”‚ Code: ""
โ”‚ Message: "An unexpected error occured while processing the request. Tracking ID: 'd00608f4-b633-41a2-af69-557cd4ee258c'"
โ”‚ Activity Id: ""
โ”‚ 
โ”‚ ---
โ”‚ 
โ”‚ API Response:
โ”‚ 
โ”‚ ----[start]----
โ”‚ {"name":"f3a142f9-aacd-4eb2-a172-f84dd899a991","status":"Failed","startTime":"2024-08-27T09:00:24.43Z","error":{"code":"InternalServerError","message":"An unexpected error occured while processing the request. Tracking ID: 'd00608f4-b633-41a2-af69-557cd4ee258c'"}}
โ”‚ -----[end]-----
โ”‚ 
โ”‚ 
โ”‚   with azurerm_postgresql_flexible_server.psql_smartlab_api,
โ”‚   on main.tf line 23, in resource "azurerm_postgresql_flexible_server" "psql_upgrade_test":
โ”‚   23: resource "azurerm_postgresql_flexible_server" "psql_upgrade_test" {
โ”‚ 
โ•ต
terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~>3.0"
    }
  }
}

This is pretty old - we used "~> 3.116.0" - maybe that helps?

Just tried it with 3.116.0 as well but the issue is the same. I have reached out to Microsoft support as well and they are currently investigating the issue.

ttichy commented

Still getting this error while trying to upgrade PG flexible server via the portal

We had issue with upgrade recently and doing restart and upgrade next day resolved the issue.

After a lot of communication with Microsoft Support, they were finally able to upgrade my instance. Here is the feedback I received from them:

-> Initially you experienced an MVU (Maintenance and Version Upgrade) failure due to pending_restart parameter was set to true. this means the server needed a restart before the upgrade could proceed.
-> An engineer restarted the container, allowing you to try the MVU again.
-> During the retry, the MVU failed again due to insufficient disk space in the /tmp directory. This directory didnโ€™t have enough space to handle the upgrade process.
-> Memory Issue: The B1ms SKU (a specific server configuration) has less than 1 GB of memory available for the cluster, which can cause MVU failures if the memory is nearly full.
-> We have addressed this in an upcoming release which removes the dependency for MVU.
-> For the third MVU attempt, our engineer initiated the upgrade from the backend and the upgrade proceeded without issue.

This confirms the behavior that some users are reporting here that simply restarting their instance fixed the problem.
Until they have shipped the new release which removes the dependency on available memory, temporarily upscaling your instance to one with more disk space and/or more memory might fix the problem as well.