spring-projects/spring-batch

#{stepExecutionContext[''] value same among partitioned threads on batch restart/resubmit.

revewo opened this issue · 0 comments

Bug description
When the batch has 3 partitions, we expect the 3 partitions to have unique execution context either in the first run, or in the restart run. We see that on a run due to a restart (same job parameters), the execution context is same in 2 different partitions.

Environment
spring-boot-starter-parent - 3.2.0
Java 17

Steps to reproduce
Created a small POC which can reproduce the issue. Steps added below.

Expected behavior
A batch with 3 partitions (all 3 having unique stepExecutionContext[''] values) where 2 partitions fail due to runtime exceptions on first run must execute the only those 2 failed partitions with stepExecutionContext[''] value being unique among those 2 partitions in the restart run

Minimal Complete Reproducible example

POC available at https://github.com/revewo/spring-batch-partition-issue-poc.

Configuration required in POC to replicate the issue:

  1. In src/main/resources/application.properties, you can change db.url based on the type of operating system you are running this code on. The database file will be created on the first batch run. We need persistence to simulate the restart issue and hence we can't use an in-memory database.
  2. Comment or uncomment lines 27 to 29 in com.example.batchprocessing.ThirdTasklet to replicate the issue.

Details / Steps to reproduce the issue

  1. On a successful execution, everything runs fine (keep the commented out 'RuntimeException' as is inside com.example.batchprocessing.ThirdTasklet#execute and run the batch). You will notice 'Inside ThirdTasklet. unique-to-partition is {}' is printed 3 times in the logs, once from each partition and each having unique value for unique-to-partition.

Now, to simulate the issue:

  1. Let us say our batch execution faces an issue at runtime such that business logic in 2 of the 3 partitions fails due to a database issue (this database issue we are mentioning is unrelated to spring batch tables and due to business tables) (uncomment the code in com.example.batchprocessing.ThirdTasklet#execute to throw that exception to simulate the issue) (also please provide a new runIdentifier in com.example.batchprocessing.BatchProcessingApplication#run when running the batch to simulate the issue).

We can see in this failed run that the log line 'Inside ThirdTasklet. unique-to-partition is {}' is printed in console 3 times and each of those 3 ends with unique values 1,2,3, just like in the successful run in step 1 above.

  1. Assume our database expert has fixed the issue on database. Now, on resubmit, we need only those two failed partitions to resume and the value of the key 'unique-to-partition' to be unique among the 2 different partitions on resubmit (since our business logic demands that the partition provide this unique information to database).
    So we resubmit the batch with the same runIdentifier we had provided in the failed run in step 2.
    Now the batch completes, but in the logs, we see that the log line "Inside ThirdTasklet. unique-to-partition is {}" is printed twice, each of those two ending with same value - either 1 or 3 but not unique.

These 2 partitions must have unique values for unique-to-partition on resubmit, just as they had those unique values on a run without any issue/exception. The fact that these values are not unique on resubmit changes the business log for us during a run.

I have also attached sample logs in case you don't want to try out the example code.

On a run where exception is thrown

2024-06-17T11:34:38.104+02:00  INFO 11452 --- [   TPTETHREAD-3] c.example.batchprocessing.ThirdTasklet   : Inside ThirdTasklet. unique-to-partition is value2
2024-06-17T11:34:39.399+02:00  INFO 11452 --- [   TPTETHREAD-2] c.example.batchprocessing.ThirdTasklet   : Inside ThirdTasklet. unique-to-partition is value3
2024-06-17T11:34:39.823+02:00  INFO 11452 --- [   TPTETHREAD-1] c.example.batchprocessing.ThirdTasklet   : Inside ThirdTasklet. unique-to-partition is value1
On the resubmit run (after fixing that exception)
2024-06-17T11:35:13.079+02:00  INFO 16936 --- [   TPTETHREAD-2] c.example.batchprocessing.ThirdTasklet   : Inside ThirdTasklet. unique-to-partition is value3
2024-06-17T11:35:13.487+02:00  INFO 16936 --- [   TPTETHREAD-1] c.example.batchprocessing.ThirdTasklet   : Inside ThirdTasklet. unique-to-partition is value3