Trivadis/plsql-formatter-settings

Wrong default file encoding on Windows

jgebal opened this issue · 10 comments

We store all source code as UTF8 files in our repo.
I'm using Windows 10 machine with git for windows installed.
I am using a standalone tvdformat.jar to run the formatting task.
The pre-commit hook works like a charm except for situation where a file contains special characters (like German accent letters).
In this case the letters are getting messed-up by the pre-commit hook.

As a workaround I've added a -Dfile.encoding=UTF-8 to the java -jar jvdformat.jar....
I'm not sure if this is the correct solution but it definitely helped in my case.
Probably a more sophisticated solution would be needed.

On a side note. The only thing I do not like about the pre-commit hook is that when you are using it, your changes are blended with formatter changes in one commit. Ideally I'd see two separate commits. One with manual changes and one with automated changes (formatter).

When reading a file the default encoding of the platform (OS) is used. In case of Windows 10 it is not UTF-8. In Germany it's probably Windows-1252. Nowadays on Windows we have files in UTF-8 character set and in the default Windows character set. Finding out which is which is quite difficult. The best approach I know is using the CharsetDetector of Apache Tika. It basically will use the character set with the highest probability, which is good enough.

It would probably make sense to extend the functionality of the format.js accordingly. However, this JavaScript is designed to work with SQLcl and the provided Java libraries. Apache Tika is not part of them. So, we need to find another solution to detect the character set. Maybe via java.nio.charset.CharsetDecoder. This should be doable with a list of character sets to try and using the platform character set if the character set cannot be identified.

In the meantime using -Dfile.encoding is probably the best option.

BTW: How do you deploy the code into the database? How is the character set identified there?

The following lines should be changed in format.js:

Both operation should use the same character set. The character set should be one of the following:

  • default (platform/OS specific as today
  • auto (detect character set based on content)
  • according parameter

This would be the most flexible solution. I'd go with "auto" as the new default.

On a side note. The only thing I do not like about the pre-commit hook is that when you are using it, your changes are blended with formatter changes in one commit. Ideally I'd see two separate commits. One with manual changes and one with automated changes (formatter).

Yes, if the formatter is applied the first time for a file, the complete file is formatted. This way you cannot distinguish between the original change and re-formatting of other unrelated code.

When you decide to use the Git hook in a project, I recommend formatting the complete code base with the formatter. You can do that with the configured settings by running the following in the root folder of your project (in GitBash on Windows):

.git/hooks/pre-commit .

Then you can check in the changes with the comment "formatted with the new formatter settings".

From that point on the scope of the committed changes should not be affected by the formatter.

We are a combination of SQLCL and Flyway do orchestrate deployments.
Now that I have checked I se that we always explicitly set the below before calling various operations from command line.

$env:java_tool_options="-Dfile.encoding=UTF-8 -Duser.language=en -Duser.country=EN"
chcp 65001

When you decide to use the Git hook in a project, I recommend formatting the complete code base with the formatter.

That approach also has 2 sides. but then your entire DB code is different than your version control.

Well, whatever you do, there will be some place of inconsistency unless you re-deploy entire DB sources.

I need to think and discuss which is worse :)

We are a combination of SQLCL and Flyway do orchestrate deployments. Now that I have checked I se that we always explicitly set the below before calling various operations from command line.

$env:java_tool_options="-Dfile.encoding=UTF-8 -Duser.language=en -Duser.country=EN"
chcp 65001

Thanks for the update, Jacek. I expected something like that. This just confirms that the format.js behaves as SQLcl. Hence I do not consider this a bug.

Nevertheless, I think it would be a good idea to try to detect the character set and add an option to override the default behavior (beside using -Dfile.encoding).

When you decide to use the Git hook in a project, I recommend formatting the complete code base with the formatter.

That approach also has 2 sides. but then your entire DB code is different than your version control.

Well, whatever you do, there will be some place of inconsistency unless you re-deploy entire DB sources.

I need to think and discuss which is worse :)

I understand. However, the code in the database and the deployment process is for sure out of scope. If you would like to ensure that there are two commits, one calling the formatter without the change and one for the formatted change then you need to implement that with an own shell script. The pre-commit Git hook is not suited for something like that. However, needing to run an own script will complicate things if you use Git from an IDE.

Maybe not using the formatter is the best option in your case.

I think the best case is to do as you suggested: Format entire repository in one commit. I only need to make sure there are no significant changes in code (as I saw with UTF8 files and non-UTF8 default console encoding).

Thank you Philipp

closed via #231

This commit is related: 7cb32ba