cafienne/cafienne-engine

Case Engine reported unhealthy after duplicate key issue and turned healthy after more then 2 hours

Closed this issue · 2 comments

cbos commented

On our test server we saw that the Cafienne Engine was unavailable to handle requests.

Response on the messages was:

 {"id":null,"status":"500","code":"Domain","title":"Refusing request, because Case System is not healthy","detail":null}

We tried to fined out the cause but this is the only thing we see in the logging.
And it recovered by itself after more the 2 hours.

Questions we have:

  • Why did this issue occur?
  • Why is this blocking all actions?
  • Why was is valid to unblock these actions after 2 hours and 20 minutes, without any other manual intervention. If releasing is valid, why could that not happen after 1 minute, or even after 1 second, or why was the block even needed?

This issue caused unavailability of the service. This was fortunately not on production, in that situation we would have had a big problem.

Info from the logging:

2022-09-23 11:37:40,953|ERROR org.cafienne.cmmn.instance.Case - Failure in Case 11c4f1d8_858a_492d_9bd6_b549563484fa during persistence of event 31 of type akka.persistence.journal.Tagged. Stopping instance.
java.sql.BatchUpdateException: Violation of PRIMARY KEY constraint 'PK__journal__9C6BF03F7FE66488'. Cannot insert duplicate key in object 'dbo.journal'. The duplicate key value is (11c4f1d8_858a_492d_9bd6_b549563484fa, 31).
        at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeBatch(SQLServerPreparedStatement.java:2085)
        at com.zaxxer.hikari.pool.ProxyStatement.executeBatch(ProxyStatement.java:128)
        at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeBatch(HikariProxyPreparedStatement.java)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.$anonfun$run$18(JdbcActionComponent.scala:542)
        at slick.jdbc.JdbcBackend$SessionDef.withPreparedStatement(JdbcBackend.scala:425)
        at slick.jdbc.JdbcBackend$SessionDef.withPreparedStatement$(JdbcBackend.scala:420)
        at slick.jdbc.JdbcBackend$BaseSession.withPreparedStatement(JdbcBackend.scala:489)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl.preparedInsert(JdbcActionComponent.scala:511)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.run(JdbcActionComponent.scala:536)
        at slick.jdbc.JdbcActionComponent$SimpleJdbcProfileAction.run(JdbcActionComponent.scala:28)
        at slick.jdbc.JdbcActionComponent$SimpleJdbcProfileAction.run(JdbcActionComponent.scala:25)
        at slick.dbio.SynchronousDatabaseAction$FusedAndThenAction.$anonfun$run$4(DBIOAction.scala:533)
        at slick.dbio.SynchronousDatabaseAction$FusedAndThenAction.$anonfun$run$4$adapted(DBIOAction.scala:533)
        at scala.collection.immutable.Vector.foreach(Vector.scala:1856)
        at slick.dbio.SynchronousDatabaseAction$FusedAndThenAction.run(DBIOAction.scala:533)
        at slick.dbio.SynchronousDatabaseAction$$anon$11.run(DBIOAction.scala:570)
        at slick.basic.BasicBackend$DatabaseDef$$anon$3.liftedTree1$1(BasicBackend.scala:276)
        at slick.basic.BasicBackend$DatabaseDef$$anon$3.run(BasicBackend.scala:276)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2022-09-23 11:37:40,954|ERROR o.c.system.health.HealthMeasurePoint - write-journal reported bad health
java.sql.BatchUpdateException: Violation of PRIMARY KEY constraint 'PK__journal__9C6BF03F7FE66488'. Cannot insert duplicate key in object 'dbo.journal'. The duplicate key value is (11c4f1d8_858a_492d_9bd6_b549563484fa, 31).
        at com.microsoft.sqlserver.jdbc.SQLServerPreparedStatement.executeBatch(SQLServerPreparedStatement.java:2085)
        at com.zaxxer.hikari.pool.ProxyStatement.executeBatch(ProxyStatement.java:128)
        at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.executeBatch(HikariProxyPreparedStatement.java)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.$anonfun$run$18(JdbcActionComponent.scala:542)
        at slick.jdbc.JdbcBackend$SessionDef.withPreparedStatement(JdbcBackend.scala:425)
        at slick.jdbc.JdbcBackend$SessionDef.withPreparedStatement$(JdbcBackend.scala:420)
        at slick.jdbc.JdbcBackend$BaseSession.withPreparedStatement(JdbcBackend.scala:489)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl.preparedInsert(JdbcActionComponent.scala:511)
        at slick.jdbc.JdbcActionComponent$InsertActionComposerImpl$MultiInsertAction.run(JdbcActionComponent.scala:536)
        at slick.jdbc.JdbcActionComponent$SimpleJdbcProfileAction.run(JdbcActionComponent.scala:28)
        at slick.jdbc.JdbcActionComponent$SimpleJdbcProfileAction.run(JdbcActionComponent.scala:25)
        at slick.dbio.SynchronousDatabaseAction$FusedAndThenAction.$anonfun$run$4(DBIOAction.scala:533)
        at slick.dbio.SynchronousDatabaseAction$FusedAndThenAction.$anonfun$run$4$adapted(DBIOAction.scala:533)
        at scala.collection.immutable.Vector.foreach(Vector.scala:1856)
        at slick.dbio.SynchronousDatabaseAction$FusedAndThenAction.run(DBIOAction.scala:533)
        at slick.dbio.SynchronousDatabaseAction$$anon$11.run(DBIOAction.scala:570)
        at slick.basic.BasicBackend$DatabaseDef$$anon$3.liftedTree1$1(BasicBackend.scala:276)
        at slick.basic.BasicBackend$DatabaseDef$$anon$3.run(BasicBackend.scala:276)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2022-09-23 11:37:40,956|WARN  org.cafienne.timerservice.TimerJob - Could not trigger timer [8d432095_f44e_4239_8963_36e7af8cae2d - 2022-09-23T11:37:40.862245151Z] in case 11c4f1d8_858a_492d_9bd6_b549563484fa:{
  "className" : "java.lang.Exception",
  "message" : "Handling the request resulted in a system failure. Check the server logs for more information.",
  "lines" : [ "org.cafienne.actormodel.StagingArea.handlePersistFailure(StagingArea.java:184)", "org.cafienne.actormodel.Warehouse.handlePersistFailure(Warehouse.java:45)", "org.cafienne.actormodel.ModelActor.handlePersistFailure(ModelActor.java:345)", "org.cafienne.actormodel.ModelActor.onPersistFailure(ModelActor.java:352)", "akka.persistence.Eventsourced.akka$persistence$Eventsourced$$writeEventFailed(Eventsourced.scala:840)", "akka.persistence.Eventsourced$ProcessingState$$anonfun$2.applyOrElse(Eventsourced.scala:875)", "scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:35)", "akka.persistence.Eventsourced$$anon$5.stateReceive(Eventsourced.scala:945)", "akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:245)", "akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:244)", "akka.persistence.AbstractPersistentActor.aroundReceive(PersistentActor.scala:295)", "akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)", "akka.actor.ActorCell.invoke(ActorCell.scala:548)", "akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)", "akka.dispatch.Mailbox.run(Mailbox.scala:231)", "akka.dispatch.Mailbox.exec(Mailbox.scala:243)", "java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source)", "java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source)", "java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)", "java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source)", "java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)" ],
  "cause" : { }
}
2022-09-23 13:54:58,537|ERROR o.c.system.health.HealthMeasurePoint - write-journal is healthy health again

Info from the health check

curl http://localhost:2027/health
{
  "Status" : "NOK",
  "Description" : "Health indication of the Case Engine is currently NOK",
  "measure-points" : [ {
    "query-db" : {
      "Status" : "OK",
      "Description" : "query-db is healthy "
    }
  }, {
    "idp" : {
      "Status" : "OK",
      "Description" : "idp is healthy "
    }
  }, {
    "write-journal" : {
      "Status" : "NOK",
      "Description" : "Failure: Violation of PRIMARY KEY constraint 'PK__journal__9C6BF03F7FE66488'. Cannot insert duplicate key in object 'dbo.journal'. The duplicate key value is (11c4f1d8_858a_492d_9bd6_b549563484fa, 31).",
      "changed-at" : "2022-09-23T11:37:40.955196774Z"
    }
  }, {
    "read-journal" : {
      "Status" : "OK",
      "Description" : "read-journal is healthy "
    }
  } ]
}

Reason for failure is unclear, it would be great if we can reproduce this situation, but with the current information that is not possible.

Reason for close down of 2 hours is (most probably) because of current HealthMonitor implementation.
This is no internal trigger to try to see if health has improved. Probably something like that should be added.

The fact that "all actions" are blocked is caused by the HealthMonitor check to close all doors upon any issue.
We can consider changes to this algorithm, but that requires a further conversation. E.g., can we disclose query APIs on case instances if the read journal is offline but the query db is online. And what about the other way around, should we encourage creating more case instances if the query db is offline.

cbos commented

close all doors upon any issue

Apparently this is not the case, it got restored by an (internal) action. That action as apparently not blocked.
So there is a situation which still can continue even when it is block-mode.

In this case it was not a connectivity issue, but an exception which is a 'user error' in the usage of the database.
Duplicate insert is not a connectivity problem.
That cause the system to think the database in unavailable.

In case it is offline, you cannot do anything, that is true.
In that case it should (actively) somehow try to recover.