conductor-oss/conductor

workflow_index.update_time NOT NULL contraint breaking upgrade

lbestatlas opened this issue · 3 comments

Describe the bug
This bug has several aspects:

  1. When running the postgres v13 flyway upgrade we have found some workflows without update_time in the json_data column, which causes update_time to be null when the data is backfilled in the v13 script. The subsequent application of the NOT NULL constraint fails
  2. As we are running old and new versions of conductor concurrently as we rollout, workflows executed in the old version have failures indexing workflows as workflow_index.update_time is NULL
  3. The new code does not set the update_time in the case of an update only an insert, so is unlikely to fix the issue it was intended to fix.
  4. The backfill script on the workflow_index can lock up the table for minutes in environments where there is a large number of workflows.

We detected these issues when upgrade our service running v3.20.0 to v3.21.3. We run conductor with the following relevant properties as per the defaults
ConductorProperties.asyncIndexingEnabled = false
PostgresProperties.onlyIndexOnStatusChange = false

Details
Conductor version: 3.21.3
Persistence implementation: Postgres
Queue implementation: dynomite
Lock: redis
Workflow definition: n/a
Task definition: n/a
Event handler definition: n/a

To Reproduce
Run the flyway upgrade with running workflows

Expected behavior
The upgrade succeeds and no errors occur during the rollout where new and old versions are executing workflows

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
This is an examples of the error from the flyway upgrade

SQL State  : 23502
Error Code : 0
Message    : ERROR: column "update_time" of relation "workflow_index" contains null values
Location   : db/migration_postgres/V13__workflow_index_columns.sql (/opt/service/nested:/opt/service/service.jar/!BOOT-INF/lib/conductor-postgres-persistence-3.21.3.jar!/db/migration_postgres/V13__workflow_index_columns.sql)
Line       : 7
Statement  : Run Flyway with -X option to see the actual statement causing the problem

Our proposed fix is to:

  1. Remove the v13 migration
  2. Add a v14 migration which adds update_time with a default of 'epoch' if the column doesn't exist, and sets the default of 'epoch' if the column does exist
  3. Fix the PostgresIndexDao.indexWorkflow to also set the update_time during an UPDATE
  4. Remove the backfill entirely as this data does not really require to be backfilled for the purpose of this change

Hi, I meet the similar issue when I upgrade from v3.18.0 to v3.21.5
It seems that conductor oss detect the v3.18.0 db schema version is 12 and need do db migration. Then db migration from v12 to v13 failed.

7831 [main] INFO org.flywaydb.core.internal.command.DbValidate [] - Successfully validated 13 migrations (execution time 00:00.060s)
7857 [main] INFO org.flywaydb.core.internal.command.DbMigrate [] - Current version of schema "public": 12
7857 [main] WARN org.flywaydb.core.internal.command.DbMigrate [] - outOfOrder mode is active. Migration of schema "public" may not be reproducible.
7880 [main] INFO org.flywaydb.core.internal.command.DbMigrate [] - Migrating schema "public" to version "13 - workflow index columns"
8703 [main] ERROR org.flywaydb.core.internal.command.DbMigrate [] - Migration of schema "public" to version "13 - workflow index columns" failed! Changes successfully rolled back.
8713 [main] WARN org.springframework.boot.web.servlet.context.AnnotationConfigServletWebServerApplicationContext [] - Exception encountered during context initialization - cancelling refresh attempt: org.springframework.beans.factory.UnsatisfiedDependencyException: Error creating bean with name 'taskResource' defined in URL [jar:nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-rest.jar!/com/netflix/conductor/rest/controllers/TaskResource.class]: Unsatisfied dependency expressed through constructor parameter 0: Error creating bean with name 'taskServiceImpl' defined in URL [jar:nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-core.jar!/com/netflix/conductor/service/TaskServiceImpl.class]: Unsatisfied dependency expressed through constructor parameter 0: Error creating bean with name 'executionService' defined in URL [jar:nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-core.jar!/com/netflix/conductor/service/ExecutionService.class]: Unsatisfied dependency expressed through constructor parameter 0: Error creating bean with name 'workflowExecutorOps' defined in URL [jar:nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-core.jar!/com/netflix/conductor/core/execution/WorkflowExecutorOps.class]: Unsatisfied dependency expressed through constructor parameter 0: Error creating bean with name 'deciderService' defined in URL [jar:nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-core.jar!/com/netflix/conductor/core/execution/DeciderService.class]: Unsatisfied dependency expressed through constructor parameter 2: Error creating bean with name 'flywayForPrimaryDb' defined in class path resource [com/netflix/conductor/postgres/config/PostgresConfiguration.class]: Migration V13__workflow_index_columns.sql failed

SQL State : 23502
Error Code : 0
Message : ERROR: column "update_time" of relation "workflow_index" contains null values
Location : db/migration_postgres/V13__workflow_index_columns.sql (/app/libs/nested:/app/libs/conductor-server.jar/!BOOT-INF/lib/conductor-postgres-persistence.jar!/db/migration_postgres/V13__workflow_index_columns.sql)
Line : 7
Statement : ALTER TABLE workflow_index
ALTER COLUMN update_time SET NOT NULL

8714 [main] INFO com.zaxxer.hikari.HikariDataSource [] - HikariPool-1 - Shutdown initiated...

Found a workaround - we just query all workflows with NULL updateTime and all of them are in running state. We terminate each of the workflow so now they have updateTime. Now upgrade succeeds.