JSON files are randomly not well formatted when multi-thread step is used
glelarge opened this issue ยท 9 comments
Bug description
Spring Batch offers the possibility to configure a chunk-oriented step as multi-threaded by configuring a TaskExecutor
in the Step configuration.
When configured with multi-thread, the produced JSON is sometimes not correctly formatted.
Detected with Spring Batch 5.1.2 and Spring Framework 6.1.13.
Wrong result
Sometime the produced JSON is misformatted :
[
,
{"code":10002,"ref":"B2C3D4E5F6G7","type":11,"nature":6,"etat":2,"ref2":"B2C3D4E5F6G7"} {"code":10001,"ref":"A1B2C3D4E5F6","type":10,"nature":5,"etat":1,"ref2":"A1B2C3D4E5F6"},
{"code":10003,"ref":"C3D4E5F6G7H8","type":12,"nature":7,"etat":3,"ref2":"C3D4E5F6G7H8"},
{"code":10004,"ref":"D4E5F6G7H8I9","type":13,"nature":8,"etat":4,"ref2":"D4E5F6G7H8I9"}
]
Expected result
[
{"code":10001,"ref":"A1B2C3D4E5F6","type":10,"nature":5,"etat":1,"ref2":"A1B2C3D4E5F6"},
{"code":10002,"ref":"B2C3D4E5F6G7","type":11,"nature":6,"etat":2,"ref2":"B2C3D4E5F6G7"},
{"code":10003,"ref":"C3D4E5F6G7H8","type":12,"nature":7,"etat":3,"ref2":"C3D4E5F6G7H8"},
{"code":10004,"ref":"D4E5F6G7H8I9","type":13,"nature":8,"etat":4,"ref2":"D4E5F6G7H8I9"}
]
When the file is not well formatted, the error is always the same :
- a single comma appears at the first line
- the second line contains 2 JSON records without separator
Minimal Complete Reproducible example
https://github.com/glelarge/spring-batch-json-multithread
git clone https://github.com/glelarge/spring-batch-json-multithread
cd spring-batch-json-multithread
mvn package
./run.sh
If the issue is related to the configuration, would it be possible to update the repo to fix the configuration ?
Thank you for raising this issue. This is actually due to the fact that JsonFileItemWriter
is not thread-safe (mentioned in its javadoc), so it is incorrect to use it as is in a multi-threaded step.
What you can do is wrap it in a SynchronizedItemStreamWriter
. This decorator is useful when using a non thread-safe item writer in a multi-threaded step (typical delegate examples are the JsonFileItemWriter
and StaxEventItemWriter
). This is mentioned in the javadoc of the decorator as well.
Here is an example: https://gist.github.com/fmbenhassine/2ed050f69d5472c6857ab9690e3715ab. The example is for the xml writer, but the same idea applies to the json writer as well.
Can you try this and share your feedback? TIA
Thank you for your suggestion, normally I already use the SynchronizedItemStreamWriter
in the writer definition :
https://github.com/glelarge/spring-batch-json-multithread/blob/79a4e9796e510b105774d0c7c01e08a8cc0b66ed/src/main/java/org/springframework/batch/MyBatchJobConfiguration.java#L139C1-L187C6
I was also using the stream()
method, but I just removed it as I don't know exactly how to use it or what is its impact.
My test has a chunk size of 1, and only 2 threads in parallel.
I have updated the chunk size to 4 to see the difference between the write of a chunk and an item.
So now the result is
[
,
{"code":10002,"ref":"B2C3D4E5F6G7","type":11,"nature":6,"etat":2,"ref2":"B2C3D4E5F6G7"},
{"code":10004,"ref":"D4E5F6G7H8I9","type":13,"nature":8,"etat":4,"ref2":"D4E5F6G7H8I9"},
{"code":10006,"ref":"F6G7H8I9J0K1","type":15,"nature":10,"etat":6,"ref2":"F6G7H8I9J0K1"},
{"code":10008,"ref":"H8I9J0K1L2M3","type":17,"nature":12,"etat":8,"ref2":"H8I9J0K1L2M3"} {"code":10001,"ref":"A1B2C3D4E5F6","type":10,"nature":5,"etat":1,"ref2":"A1B2C3D4E5F6"},
{"code":10003,"ref":"C3D4E5F6G7H8","type":12,"nature":7,"etat":3,"ref2":"C3D4E5F6G7H8"},
{"code":10005,"ref":"E5F6G7H8I9J0","type":14,"nature":9,"etat":5,"ref2":"E5F6G7H8I9J0"},
{"code":10007,"ref":"G7H8I9J0K1L2","type":16,"nature":11,"etat":7,"ref2":"G7H8I9J0K1L2"},
{"code":10009,"ref":"I9J0K1L2M3N4","type":18,"nature":13,"etat":9,"ref2":"I9J0K1L2M3N4"},
{"code":10010,"ref":"J0K1L2M3N4O5","type":19,"nature":14,"etat":10,"ref2":"J0K1L2M3N4O5"},
{"code":10011,"ref":"K1L2M3N4O5P6","type":20,"nature":15,"etat":11,"ref2":"K1L2M3N4O5P6"},
{"code":10012,"ref":"L2M3N4O5P6Q7","type":21,"nature":16,"etat":12,"ref2":"L2M3N4O5P6Q7"},
{"code":10013,"ref":"M3N4O5P6Q7R8","type":22,"nature":17,"etat":13,"ref2":"M3N4O5P6Q7R8"},
{"code":10014,"ref":"N4O5P6Q7R8S9","type":23,"nature":18,"etat":14,"ref2":"N4O5P6Q7R8S9"},
{"code":10015,"ref":"O5P6Q7R8S9T0","type":24,"nature":19,"etat":15,"ref2":"O5P6Q7R8S9T0"},
{"code":10016,"ref":"P6Q7R8S9T0U1","type":25,"nature":20,"etat":16,"ref2":"P6Q7R8S9T0U1"},
{"code":10017,"ref":"Q7R8S9T0U1V2","type":26,"nature":21,"etat":17,"ref2":"Q7R8S9T0U1V2"},
{"code":10018,"ref":"R8S9T0U1V2W3","type":27,"nature":22,"etat":18,"ref2":"R8S9T0U1V2W3"},
{"code":10019,"ref":"S9T0U1V2W3X4","type":28,"nature":23,"etat":19,"ref2":"S9T0U1V2W3X4"}
]
where it should be
[
{"code":10001,"ref":"A1B2C3D4E5F6","type":10,"nature":5,"etat":1,"ref2":"A1B2C3D4E5F6"},
{"code":10003,"ref":"C3D4E5F6G7H8","type":12,"nature":7,"etat":3,"ref2":"C3D4E5F6G7H8"},
{"code":10005,"ref":"E5F6G7H8I9J0","type":14,"nature":9,"etat":5,"ref2":"E5F6G7H8I9J0"},
{"code":10007,"ref":"G7H8I9J0K1L2","type":16,"nature":11,"etat":7,"ref2":"G7H8I9J0K1L2"},
{"code":10002,"ref":"B2C3D4E5F6G7","type":11,"nature":6,"etat":2,"ref2":"B2C3D4E5F6G7"},
{"code":10004,"ref":"D4E5F6G7H8I9","type":13,"nature":8,"etat":4,"ref2":"D4E5F6G7H8I9"},
{"code":10006,"ref":"F6G7H8I9J0K1","type":15,"nature":10,"etat":6,"ref2":"F6G7H8I9J0K1"},
{"code":10008,"ref":"H8I9J0K1L2M3","type":17,"nature":12,"etat":8,"ref2":"H8I9J0K1L2M3"},
{"code":10009,"ref":"I9J0K1L2M3N4","type":18,"nature":13,"etat":9,"ref2":"I9J0K1L2M3N4"},
{"code":10010,"ref":"J0K1L2M3N4O5","type":19,"nature":14,"etat":10,"ref2":"J0K1L2M3N4O5"},
{"code":10011,"ref":"K1L2M3N4O5P6","type":20,"nature":15,"etat":11,"ref2":"K1L2M3N4O5P6"},
{"code":10012,"ref":"L2M3N4O5P6Q7","type":21,"nature":16,"etat":12,"ref2":"L2M3N4O5P6Q7"},
{"code":10013,"ref":"M3N4O5P6Q7R8","type":22,"nature":17,"etat":13,"ref2":"M3N4O5P6Q7R8"},
{"code":10014,"ref":"N4O5P6Q7R8S9","type":23,"nature":18,"etat":14,"ref2":"N4O5P6Q7R8S9"},
{"code":10015,"ref":"O5P6Q7R8S9T0","type":24,"nature":19,"etat":15,"ref2":"O5P6Q7R8S9T0"},
{"code":10016,"ref":"P6Q7R8S9T0U1","type":25,"nature":20,"etat":16,"ref2":"P6Q7R8S9T0U1"},
{"code":10017,"ref":"Q7R8S9T0U1V2","type":26,"nature":21,"etat":17,"ref2":"Q7R8S9T0U1V2"},
{"code":10018,"ref":"R8S9T0U1V2W3","type":27,"nature":22,"etat":18,"ref2":"R8S9T0U1V2W3"},
{"code":10019,"ref":"S9T0U1V2W3X4","type":28,"nature":23,"etat":19,"ref2":"S9T0U1V2W3X4"}
]
So my opinion is that the first thread must race to be the first to get the semaphore in TaskletStep
.
As JSON format has separator, and this separator is set in front of the item to be write, I suspect an issue between the synchronization and the semaphore acquirement, at least for the first chunk.
Other formats (XML, CSV, flat files) do not have separator, so it's enough to write any records regardless the order.
With a separator, either the first or the last must be written carefully to manage the addition or not of a separator.
I start to think on a way to identify the thread that writes the first chunk, it should be possible based on the state.getLinesWritten()
and a new chunk attribute to identify this is the first chunk.
oh sorry i did not check the code you shared, my bad.
In that case, I need to dig deeper. I will add a comment here when done.
To "dig deeper", you're sooooo right !
After having thought about multiple possibilities (in fact more workarounds than fixes), I just pushed a PR with a minimal impact to enforce the first chunk to be processed by the first thread.
Based on a CountDownLatch
, other threads are created after the first thread is completed.
An improvement could be to make this latch optional (by default set to 0 to deactivate it) as this is not necessary for output formats without record separator (XML, CSV).
To "dig deeper", you're sooooo right !
๐คฃ Welcome to the deepest rabbit hole of Spring Batch ๐
After having thought about multiple possibilities (in fact more workarounds than fixes), I just pushed a PR with a minimal impact to enforce the first chunk to be processed by the first thread.
I really appreciate your time and effort to open a workaround PR, but I just wanted to let you know that we are working on a complete rewrite of the implementation, see #3950. We have patched things around for years and this has to stop at some point. For this reason, we are not going to invest time on working around issues anymore. Rather, we are focusing on testing the initial prototype of the new implementation. If you want to help, I would appreciate if you give that a try and share your feedback.
oh no, I was hoping at least a 5.1.3 or 5.2 ๐
we start a project to develop >50 batches requiring multi-thread and wanted to rely on official deliveries, but ok we'll embed the patch in our sources to fix the JSON writing issue
it's so hard to follow the execution line as you unfolded in #3950 , that I only can understand and approve the need of rework, sometimes it's better to restart from blank page
I started to read the new implementation, will try to test it when free time on the project and will come back to you for feedbacks, and probably questions notably related on differences between ChunkOrientedStep
and ConcurrentChunkOrientedStep
(but here is not the place).
While I agree with much of the assessments on the current implementation of multi-threaded steps, I don't think it's actually at fault here. ๐
@glelarge I think your analysis of the problem is correct: The root cause is that the TransactionAwareBufferedWriter
underneath the writer lacks synchronization of the delayed writes between the threads. And the SynchronizedItemStreamWriter
cannot address this as the transactions around the write
of the writer are wider than the scope of its synchronization.
But I believe the solution is to just disable the TransactionAwareBufferedWriter
like this:
new JsonFileItemWriterBuilder<MyData>()
...
.transactional(false)
.build();
The writer will then flush within the scope of the synchronization of SynchronizedItemStreamWriter
and everything should work as expected. The boolean flag forceSync
can be set to either value and doesn't have an impact on the issue at hand.
Disabling the TransactionAwareBufferedWriter
for a multi-threaded writer is fine in my opinion as it's only beneficial for the restartability of jobs. But multi-threaded steps lack restartability in general, so nothing is lost by disabling it.
@hpoettker Thank you !
I had seen the transactional in the TransactionAwareBufferWriter
but did not pull the thread until this setting in JsonFileItemWriterBuilder
, and did not see nor imagine this kind of setting in the API ๐
With the transactional to false
, JSON files are correctly generated ๐
Thank you for the feedback @glelarge ! and thank you for the follow up on this @hpoettker !
Based on the last comments, I am closing this issue and the corresponding PR.