0.6.0 alpha-server constantly throw NPE in "Global Tx Timeout Situation"
Richard-Yi opened this issue · 6 comments
Testing scenarios
I was testing how ServiceComb behave when the global Tx is timed out.
There were 3 services:
- order service
- account service'
- storage service
The order service calls the account service and the storage service in turn.
- The order service creates an order locally
- The account service deducts the corresponding amount of money of the corresponding user
- The storage service deducts the inventory of the corresponding count of the corresponding product
The testing scenarios is as below:
-
Order Service set the global Tx timout to
10s
via@SagaStart(timeout = 10)
-
storage service will simulate a timeout of 10 seconds via
Thread.sleep()
code
order serivce
@SagaStart(timeout = 10)
@Transactional(rollbackFor = Exception.class)
public boolean create(Order order) {
// 1. 创建订单
orderLocalService.createLocal(order);
// 扣钱
String result = accountClient.decrease(order.getUserId(), order.getMoney());
if (Objects.equals(result, "fail")) {
log.error("扣库存失败,回滚本地事务");
throw new RuntimeException("扣钱失败");
}
// 扣库存
result = storageClient.decrease(order.getProductId(), order.getCount());
if (Objects.equals(result, "fail")) {
log.error("扣库存失败,回滚本地事务");
throw new RuntimeException("扣库存失败");
}
return true;
}
account service
@Override
@Transactional(rollbackFor = Exception.class)
@Compensable(compensationMethod = "compensate")
public void decrease(Long userId, BigDecimal money) {
accountMapper.decrease(userId, money);
}
storage service
@Transactional(rollbackFor = Exception.class)
@Compensable(compensationMethod = "compensate")
public void decrease(Long productId, Integer count) {
log.info("执行 storage 正常流程, 全局事务ID: {}, 分支事务ID: {}", omegaContext.globalTxId(), omegaContext.localTxId());
storageMapper.decrease(productId, count);
// 11s time out
try {
Thread.sleep(11000);
} catch (InterruptedException e) {
e.printStackTrace();
}
}
Problem Description
In the above test case, when the global transaction times out, alpha-server will continuously throw NPE exceptions. The exception log is as follows:
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-14] o.a.s.p.a.f.SagaActor : recovery SagaStartedEvent{timeout=10} {"@type":"SagaStartedEvent","serviceName":"order-server","instanceId":"order-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","createTime":1612851846898,"timeout":10,"type":"SagaStartedEvent"}
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-14] o.a.s.p.a.f.SagaActor : recovery {"@type":"TxStartedEvent","serviceName":"account-server","instanceId":"account-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"dc5e50b7-260e-4f8a-9e42-ead972d6ab28","createTime":1612851846912,"compensationMethod":"public void com.ricstudio.account.service.AccountServiceImpl.compensate(java.lang.Long,java.math.BigDecimal)","payloads":"AQMJAgEAamF2YS5tYXRoLkJpZ0RlY2ltYewBAmQA","retryMethod":"","forwardRetries":0,"forwardTimeout":0,"reverseRetries":0,"reverseTimeout":0,"retryDelayInMilliseconds":5,"type":"TxStartedEvent"}
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-14] o.a.s.p.a.f.SagaActor : recovery {"@type":"TxEndedEvent","serviceName":"account-server","instanceId":"account-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"dc5e50b7-260e-4f8a-9e42-ead972d6ab28","createTime":1612851846944,"type":"TxEndedEvent"}
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-12] o.a.s.p.a.f.SagaActor : recovery {"@type":"TxStartedEvent","serviceName":"storage-server","instanceId":"storage-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"2d815537-efa5-43ca-a2e4-3078679c8903","createTime":1612851846950,"compensationMethod":"public void com.ricstudio.storage.service.StorageServiceImpl.compensate(java.lang.Long,java.lang.Integer)","payloads":"AQMJAgIU","retryMethod":"public void com.ricstudio.storage.service.StorageServiceImpl.decrease(java.lang.Long,java.lang.Integer)","forwardRetries":3,"forwardTimeout":0,"reverseRetries":0,"reverseTimeout":0,"retryDelayInMilliseconds":5,"type":"TxStartedEvent"}
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-12] o.a.s.p.a.f.SagaActor : recovery null
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-14] o.a.s.p.a.f.SagaActor : recovery completed [20fe99d8-1339-4bda-9a50-285b66eeab3f] state=SUSPENDED
2021-02-09 14:24:36.871 ERROR ruichadeMacBook.local --- [t-dispatcher-14] o.a.s.p.a.f.SagaActor : stop [20fe99d8-1339-4bda-9a50-285b66eeab3f] fail
2021-02-09 14:24:36.871 ERROR ruichadeMacBook.local --- [t-dispatcher-14] a.a.OneForOneStrategy : null
java.lang.NullPointerException
at org.apache.servicecomb.pack.alpha.core.fsm.repository.model.SagaSubTransaction$Builder.build(SagaSubTransaction.java:103)
at org.apache.servicecomb.pack.alpha.fsm.spring.integration.akka.SagaDataExtension$SagaDataExt.lambda$stopSagaData$0(SagaDataExtension.java:76)
at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
at org.apache.servicecomb.pack.alpha.fsm.model.TxEntities.forEach(TxEntities.java:32)
at org.apache.servicecomb.pack.alpha.fsm.spring.integration.akka.SagaDataExtension$SagaDataExt.stopSagaData(SagaDataExtension.java:69)
at org.apache.servicecomb.pack.alpha.fsm.SagaActor.beforeStop(SagaActor.java:387)
at org.apache.servicecomb.pack.alpha.fsm.SagaActor.lambda$new$31(SagaActor.java:325)
at akka.persistence.fsm.japi.pf.FSMStateFunctionBuilder$2.apply(FSMStateFunctionBuilder.java:84)
at akka.persistence.fsm.japi.pf.FSMStateFunctionBuilder$2.apply(FSMStateFunctionBuilder.java:81)
at akka.japi.pf.CaseStatement.apply(CaseStatements.scala:18)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.CaseStatement.applyOrElse(CaseStatements.scala:13)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
at akka.persistence.fsm.PersistentFSMBase.processEvent(PersistentFSMBase.scala:459)
at akka.persistence.fsm.PersistentFSMBase.processEvent$(PersistentFSMBase.scala:456)
at akka.persistence.fsm.AbstractPersistentFSMBase.processEvent(PersistentFSMBase.scala:639)
at akka.persistence.fsm.PersistentFSMBase.akka$persistence$fsm$PersistentFSMBase$$processMsg(PersistentFSMBase.scala:453)
at akka.persistence.fsm.PersistentFSMBase$$anonfun$receive$1.applyOrElse(PersistentFSMBase.scala:448)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.persistence.fsm.AbstractPersistentFSM.akka$persistence$Eventsourced$$super$aroundReceive(PersistentFSM.scala:441)
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:743)
at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:222)
at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:221)
at akka.persistence.fsm.AbstractPersistentFSM.aroundReceive(PersistentFSM.scala:441)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
at akka.actor.ActorCell.invoke(ActorCell.scala:557)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor : recovery SagaStartedEvent{timeout=10} {"@type":"SagaStartedEvent","serviceName":"order-server","instanceId":"order-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","createTime":1612851846898,"timeout":10,"type":"SagaStartedEvent"}
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor : recovery {"@type":"TxStartedEvent","serviceName":"account-server","instanceId":"account-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"dc5e50b7-260e-4f8a-9e42-ead972d6ab28","createTime":1612851846912,"compensationMethod":"public void com.ricstudio.account.service.AccountServiceImpl.compensate(java.lang.Long,java.math.BigDecimal)","payloads":"AQMJAgEAamF2YS5tYXRoLkJpZ0RlY2ltYewBAmQA","retryMethod":"","forwardRetries":0,"forwardTimeout":0,"reverseRetries":0,"reverseTimeout":0,"retryDelayInMilliseconds":5,"type":"TxStartedEvent"}
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor : recovery {"@type":"TxEndedEvent","serviceName":"account-server","instanceId":"account-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"dc5e50b7-260e-4f8a-9e42-ead972d6ab28","createTime":1612851846944,"type":"TxEndedEvent"}
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor : recovery {"@type":"TxStartedEvent","serviceName":"storage-server","instanceId":"storage-server-172.20.160.108","globalTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","parentTxId":"20fe99d8-1339-4bda-9a50-285b66eeab3f","localTxId":"2d815537-efa5-43ca-a2e4-3078679c8903","createTime":1612851846950,"compensationMethod":"public void com.ricstudio.storage.service.StorageServiceImpl.compensate(java.lang.Long,java.lang.Integer)","payloads":"AQMJAgIU","retryMethod":"public void com.ricstudio.storage.service.StorageServiceImpl.decrease(java.lang.Long,java.lang.Integer)","forwardRetries":3,"forwardTimeout":0,"reverseRetries":0,"reverseTimeout":0,"retryDelayInMilliseconds":5,"type":"TxStartedEvent"}
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor : recovery null
2021-02-09 14:24:36.871 INFO ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor : recovery completed [20fe99d8-1339-4bda-9a50-285b66eeab3f] state=SUSPENDED
2021-02-09 14:24:36.872 ERROR ruichadeMacBook.local --- [t-dispatcher-29] o.a.s.p.a.f.SagaActor : stop [20fe99d8-1339-4bda-9a50-285b66eeab3f] fail
2021-02-09 14:24:36.872 ERROR ruichadeMacBook.local --- [t-dispatcher-29] a.a.OneForOneStrategy : null
java.lang.NullPointerException
at org.apache.servicecomb.pack.alpha.core.fsm.repository.model.SagaSubTransaction$Builder.build(SagaSubTransaction.java:103)
at org.apache.servicecomb.pack.alpha.fsm.spring.integration.akka.SagaDataExtension$SagaDataExt.lambda$stopSagaData$0(SagaDataExtension.java:76)
at java.util.LinkedHashMap.forEach(LinkedHashMap.java:684)
at org.apache.servicecomb.pack.alpha.fsm.model.TxEntities.forEach(TxEntities.java:32)
at org.apache.servicecomb.pack.alpha.fsm.spring.integration.akka.SagaDataExtension$SagaDataExt.stopSagaData(SagaDataExtension.java:69)
at org.apache.servicecomb.pack.alpha.fsm.SagaActor.beforeStop(SagaActor.java:387)
at org.apache.servicecomb.pack.alpha.fsm.SagaActor.lambda$new$31(SagaActor.java:325)
at akka.persistence.fsm.japi.pf.FSMStateFunctionBuilder$2.apply(FSMStateFunctionBuilder.java:84)
at akka.persistence.fsm.japi.pf.FSMStateFunctionBuilder$2.apply(FSMStateFunctionBuilder.java:81)
at akka.japi.pf.CaseStatement.apply(CaseStatements.scala:18)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.CaseStatement.applyOrElse(CaseStatements.scala:13)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:168)
at akka.persistence.fsm.PersistentFSMBase.processEvent(PersistentFSMBase.scala:459)
at akka.persistence.fsm.PersistentFSMBase.processEvent$(PersistentFSMBase.scala:456)
at akka.persistence.fsm.AbstractPersistentFSMBase.processEvent(PersistentFSMBase.scala:639)
at akka.persistence.fsm.PersistentFSMBase.akka$persistence$fsm$PersistentFSMBase$$processMsg(PersistentFSMBase.scala:453)
at akka.persistence.fsm.PersistentFSMBase$$anonfun$receive$1.applyOrElse(PersistentFSMBase.scala:448)
at akka.actor.Actor.aroundReceive(Actor.scala:517)
at akka.actor.Actor.aroundReceive$(Actor.scala:515)
at akka.persistence.fsm.AbstractPersistentFSM.akka$persistence$Eventsourced$$super$aroundReceive(PersistentFSM.scala:441)
at akka.persistence.Eventsourced$$anon$1.stateReceive(Eventsourced.scala:743)
at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:222)
at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:221)
at akka.persistence.fsm.AbstractPersistentFSM.aroundReceive(PersistentFSM.scala:441)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:588)
at akka.actor.ActorCell.invoke(ActorCell.scala:557)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
at akka.dispatch.Mailbox.run(Mailbox.scala:225)
at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
The alpha UI show the suspended tx reaches 45.46K+...
My exploration of the Problem
I debug the alpha in my local env and make the break point according to the error stacktrace. Then I found what cause the probleam.
when alpha found the global tx is timed out, the global tx state will be set to SUSPEND
,and will stop the sagaActor.
before stopping, it will save the GlobalTransaction to the es. The NPE happens when building the subTransaction.
although the global tx is SUSPEND
, the sub tx is still active and its endtime is null, and then cause NPE here.
Thank you, I need some time to deal with this problem
This bug same as the JIRA issue SCB-2004 and has been fixed in #667
sincerely Thanks for the reply.
@coolbeevip But sorry, I didn’t figure out that these two are the same problem. Could you please explain more about it. And is the problem fixed in 0.0.7-SNAPSHOT
?
This bug same as the JIRA issue SCB-2004 and has been fixed in #667
sincerely Thanks for the reply.
@coolbeevip But sorry, I didn’t figure out that these two are the same problem. Could you please explain more about it. And is the problem fixed in0.0.7-SNAPSHOT
?
You can try build 0.7.0-SNAPSHOT with branch master
This bug same as the JIRA issue SCB-2004 and has been fixed in #667
sincerely Thanks for the reply.
@coolbeevip But sorry, I didn’t figure out that these two are the same problem. Could you please explain more about it. And is the problem fixed in0.0.7-SNAPSHOT
?You can try build 0.7.0-SNAPSHOT with branch master
@coolbeevip well, I tried 0.7.0-SNAPSHOT with branch master
, the problem still exists and the exception stack trace is the same.
This bug same as the JIRA issue SCB-2004 and has been fixed in #667
sincerely Thanks for the reply.
@coolbeevip But sorry, I didn’t figure out that these two are the same problem. Could you please explain more about it. And is the problem fixed in0.0.7-SNAPSHOT
?You can try build 0.7.0-SNAPSHOT with branch master
@coolbeevip well, I tried
0.7.0-SNAPSHOT with branch master
, the problem still exists and the exception stack trace is the same.
Hmm. I recently released docker https://hub.docker.com/repository/docker/coolbeevip/servicecomb-pack/tags?page=1&ordering=last_updated, can you try this again?
docker run \
-p 8080:8080 \
-p 8090:8090 \
-p 5432:5432 \
coolbeevip/servicecomb-pack:0.7.0-all-in-one