On the fly hash before and after to replace current caching system files.

Question

On the fly hash before and after to replace current caching system files.

Opened this issue a year ago · 1 comments

Today, the caching system relies on creation of a cache file. After spending a great deal of time with this over last year or so and saving the cache across multiple platforms and working with it at scale, it is very apparent we don't need the cache file at all. Its in fact a negative feature especially if trying to store it in source code as override due to extra need for git attributes to ignore conflicts on the cache and inability to control that well on cloud build systems that talk directly to git such as bitbucket via pull requests due to limited permissions. That just results in lots of merge conflicts on the remote git instance. The file under 'clean' usage that most likely use in default setup would be empty so its always going to read file, format it, hash it, then write it. If it is using a stored cache and cross platform it will have double the files and nothing even cleans out removed files. Even in case of just using target, its only good if 'clean' is not ran. This file can be rather exponential in size even and is buffered in full into properties and held for entire duration of processing. At best the cache just avoids formatting / hashing / writing but the original problems were most likely the writing not formatting or hashing.

So proposal is that we deprecate entirely the current caching using file. Instead, the real way to handle this is rather straight forward and probably does not have any significant impact to performance. I suspect it improves it. Other plugins (from readings) such as spotless take some similar approach. The basic adjustment simply needs to be that we 'read' the file, we immediately 'hash' the file saving that in memory, we 'format' the file, we 'hash' the newly formatted file, and then just check if the hashing changed as already done before we then decide to 'write' the file.

This amounts to zero read of a cache file which already abuses expectations of the properties file setup by sorting and removal of timestamps. Instead, it includes one extra call to hashing during the process and keeps far less in memory. The original read file is stored entirely in memory which could be a problem currently. So now two short lived hashes without properties read.

This ticket I'm assigning myself and expect to do before year end. Expectations here are to discuss the proposal and any edge cases I'm not thinking of. We can deprecate the old way in fact if needed but I feel this will be rather drop in place and only a benefit and should mostly turn off old way but we could play that by ear by adding newer feature and leaving old. I do not think we should be cleaning up any cached files that users might have saved off here either but rather if we see old configuration and using new we may want to emit warning that old configuration detected that can be removed. I further think this method of attack here is more portable to other plugins like impsort, license, and many others without introducing further abuse of properties here that likely doesn't really save any time.

Now all this said, here is the biggest kicker. We already are doing this entirely while saving little. We only end up involving a file probably because it was there from the original days when it was detected not working at all properly.

See in FormatterMojo doFormatFile for actual logic but it reads like this.

Passed in this property file of hashes (full and even double if multi platform)
Read the original file
Get hash on the original file (what I'm proposing simply makes us not need a cache storage file)
We claim at this point that if the hash we determined is in the file, we don't need to format but there is a ticket open on this being a problem already and seen it far too many times myself. If configs changed, how would we really know, we didn't even try to format yet and we know formatting isn't that hard of a hit (historically). Issue there is #453. So this would be dropped from processing which is a bit confusing to read as it is (no comments).
Now we format the file which many include removing trailing line endings and in some cases making sure last entry is empty as git prefers.
If the file was not determined skipped we now hash again
If the hash doesn't match then we write the file.

The cost here has always been write the file. So tradeoff here would be that we run format every time. It goes to say that configuration never is an issue then. One changes it, there is no cache to invalidate it just does its thing. If other plugins do other changes to files we still have potential further runs still that formatting needs to address but otherwise this should be mostly seen as a no-op.

Considerations

Would need a large repo to see what kind of impact if any this would have.

I plan to try to code up the solution sometime in last week of the year. Depending on thoughts here, I'll base that writeup on what we decide in relation to deprecation and what not. If we deprecate, it will be some overhead on support for a while but once removed I think support will be easier.

Answer 1 · 2022-12-15T06:24:26.000Z

If the hypothesis is correct that it's the reading/writing the files that's the slow part and always formatting isn't harmful to performance, then this seems like a good improvement.

I think it could still be useful to allow forcing writing formatted files, regardless of whether the before/after hashes match... for the extremely rare cases where there is a hash collision or the user wants to be absolutely sure things were formatted. Such a flag could even be a hidden flag, for development purposes... like using a Java system property rather than something set in the Maven configuration.

I would be curious to know what the impact would be on a large build, for both the already-formatted case, and the not-formatted case.

If this change works better, the old caching file could not just be deprecated, but turned into a NOOP to be removed later.