Encoding of the output file

Question

Encoding of the output file

felipebz opened this issue 7 years ago · 22 comments

Hi!

We're starting to use utPLSQL-cli to export the test execution reports for SonarQube, but we've found a bug when the output is saved to file and the output contains "special" characters.

Since utPLSQL itself doesn't specify an XML declaration in the XML output, SonarQube expects a UTF-8 file but the file written by utPLSQL-cli is using the default encoding (from the environment).

A simple test case:

create or replace package test_failure is
  -- %suite
  
  -- %test
  procedure should_fail;
end test_failure;
/
create or replace package body test_failure is
  procedure should_fail is
  begin
    ut.fail('Acentuação');
  end;
end test_failure;
/

I'm generating the report using: utplsql run <db info> -p=test_failure -f=ut_sonar_test_reporter -o=tests.xml

The output, as expected, is a file with this content:

<testExecutions version="1">
<file path="test_failure">
<testCase name="should_fail" duration="6" >
<failure message="some expectations have failed">
<![CDATA[
Acentuação
]]>
</failure>
</testCase>
</file>
</testExecutions>

Running a SonarQube analysis with -Dsonar.testExecutionReportPaths=tests.xml causes the exception:

ERROR: Error during SonarQube Scanner execution
Error during parsing of generic test execution report '<path>\tests.xml'. Look at the SonarQube documentation to know the expected XML format.
Caused by: com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 middle byte 0xe3 (at char #172, byte #-1)

Converting the file to UTF-8 ou adding a <?xml version="1.0" encoding="windows-1252"?> fixes the problem.

Some info about the environment:

utPLSQL-cli v3.1.0 is running on Windows 10, language pt-BR
utPLSQL v3.0.4.1372 on Oracle 11.2.0.4.0 (on Linux, but I think it doesn't matter)
NLS_CHARACTERSET is defined as WE8ISO8859P1

I'm not sure if this should be reported here (to generate the file using the UTF-8 encoding) or this should be handled in the utPLSQL project (including an XML declaration in the output). I could send a PR for this, but I wanted to make sure that you are aware of it first. ;-)

Answer 1 · 2018-04-02T21:15:16.000Z

Thanks for reporting this issue.
Honestly I'm not sure if the encoding of content is defined at DB or client (session) level.
As you say this could be fixed at any of the two ends.
Just I'm not sure which one would be more valuable and accurate.

Answer 2 · 2018-04-04T13:18:09.000Z

Hi @jgebal.

I've made some tests and I think it's reasonably safe to assume that Oracle driver will handle the database encoding correctly in all situations, when orai18n.jar is on CLASSPATH (for compatibility with all charsets).

The Java String class documentation states that "A String represents a string in the UTF-16 format (...)". Therefore we could say that, regardless the environment, a Java string is not affected by the database encoding and we could use any encoding to write it to a file stream.

Using UTF-8 to write the files seems to be a safe option to me. I've sent the PR #79 with this change. :-)

Answer 3 · 2018-04-04T13:29:46.000Z

Thanks for the PR.
I also think the change should be done in cli. I'm just not sure if we should just output to UTF-8 by default. Java will by default take the encoding "normal" to the environment, which will be some Western-ISO for a european windows version but something completely different for a chinese one.

I don't like including this as hard change because it might break someone else's current chain. I can imagine introducing a new command-line parameter to control the charset or using environment variables.

Answer 4 · 2018-04-04T13:39:16.000Z

Yes, I understand. A reporter parameter would be better then. Something like -f=ut_sonar_test_reporter -o=output.xml -e=utf-8?

Answer 5 · 2018-04-04T14:03:10.000Z

Yes. Maybe also support the charset part of LC_ALL environment variable.

@jgebal what's your opinion on that?

Answer 6 · 2018-04-04T16:45:31.000Z

It would be best if we can make it work for everyone without exposing additional complexity to the users.

Sonar expects utf-8 encoding if no encoding is specified in xml
I think we can try 2 things:

Fix XML reporters so that we put session charset into xlm header.
Convert everything to UTF-8 in cli
Give control to users

With option 3 I would still prefer it only to be needed in "specific cases".

If option 1 would actually work it might make sense to use it.

Answer 7 · 2018-04-04T18:49:27.000Z

Yes, you are right. We should first of all write the used encoding to the output header.
I still think giving users control about the output encoding in CLI would be a benefit, not only for Sonar output.

Answer 8 · 2018-06-04T23:40:58.000Z

@felipebz, @pesse
Have a look at the analysis I did in utPLSQL/utPLSQL#676

I think it would make sense to keep cli reporter-agnostic and keep the logic in one place.
Maybe we should add additional parameter a_encoding to ut_runner.run procedure?
That could be passed from cli and by default take value from client's NLS_LANG

Answer 9 · 2018-06-06T10:35:45.000Z

I tried a lot of things, but I guess I'll have to set up a database with this specific characterset configuration so I can get a feeling where in the chain the problem occurs.
The thing is: If the locales are set properly, JDBC will take care of the database -> Java (UTF-16) conversion.
It might even have more to do with the java Locale and we should use Charset.defaultCharset() to inject an encoding-attribute into the XML.
Question is whether to add an additional parameter to ut_runner.run (logic in database but framework-dependent) or to inject it in cli/java-api (logic not in framework but downwards-compatible)

Answer 10 · 2018-06-06T10:39:54.000Z

@jgebal What do you think about the following approach:
I'll add handling for that in cli/java-api, injecting Charset.defaultCharset() into ut_sonar_test_reporter in developer branch and @felipebz can maybe check if it solves his problem? We can still decide to shift the logic into the database at a later point.

Answer 11 · 2018-06-07T06:43:16.000Z

Oh, I keep forgetting that.
Adding new parameter is backward compatible from utPLSQL perspective but it makes it hard to keep java-api backward compatible.

If we go with cli/api-only change, how will you know which reporter's outputs need to be modified (without hardcoding it) ?

Answer 12 · 2018-06-07T06:54:31.000Z

One idea:

if output-file ends with *.xml output will start with the following line:

<?xml version="1.0" encoding="Charset.defaultCharset()" ?>

We can even go that far to check whether output already starts with an XML-Tag and only add it if necessary.
Would also have the benefit that we are producing "valid" XML files that way.
For normal text files we have no way to inform which encoding is used.
So the logic would be like that:

If hasOutputFile && outputFileName.endsWith(".xml") {
  if hasXmlTag {
    if xmlTagContainsEncoding 
      updateEncoding(Charset.getDefaultCharset())
    else 
      addEncoding(Charset.getDefaultCharset())
  }
  else
    addXmlTagWithEncoding(Charset.getDefaultCharset())
}

Answer 13 · 2018-06-07T07:00:05.000Z

Nice.
Two notes:

How about checking if content is actually an xml rather than relying on user provoding a valid extension? Would it be too heavy for large files (100MB and more)?
Can we donthe same for HTML?

Answer 14 · 2018-06-07T07:06:29.000Z

How do you check content is xml? I wouldn't want to wait until thw whole file is created, that would be a possible memory-problem.
What we could check is, whether first line is one of those:

<?xml ...
<anything>

We might get a problem if content starts with a comment, though.

HTML is a bit more problematic because charset information is not in the first line as is with XML but in the <meta> tag anywhere inside the header - so some buffering would be needed.
It should be possible though.

Answer 15 · 2018-06-07T07:08:58.000Z

And how much effort it is to add new parameter ro ur_runner.run while keeping api backward compatible?

Answer 16 · 2018-06-07T07:12:00.000Z

I already introduced Compatibility-Layers, so it should be not a big problem. You prefer doing the logic inside the reporters themselves? It would be much clearer, yes.

Answer 17 · 2018-06-07T07:14:14.000Z

Yes, after thinking about it we should work on the reporters to give valid output instead of injecting it in a post-processing step.
Creating a new compatiblity-API-layer is not a problem. What's great is that it won't even need change in API towards cli/maven-plugin, it could all be done in java-api. If utPLSQL Version is 3.1.2 or higher, just fill in new parameter to ut_runner.run with Charset.getDefaultCharset() as value.

Answer 18 · 2018-06-07T07:58:43.000Z

Lovely! So we have a plan :)

Answer 19 · 2018-06-08T13:51:56.000Z

@jgebal @pesse 👍

I think utPLSQL/utPLSQL#676 (comment) will solve this issue.

Answer 20 · 2018-06-13T07:27:55.000Z

@felipebz Can you verify it's working with develop-branch of utPLSQL framework and latest develop-binary of cli? (https://bintray.com/utplsql/utPLSQL-cli/download_file?file_path=utPLSQL-cli-develop-201806130715.zip)

Answer 21 · 2018-06-21T12:57:21.000Z

Hi @pesse. It works. 👍

For the sake of completeness:

ut.version = v3.1.2.2016-develop
cli 3.1.1-SNAPSHOT.local
utPLSQL-java-api 3.1.1-SNAPSHOT.205

The file was correctly generated with <?xml version="1.0" encoding="WINDOWS-1252"?>
The SonarQube Scanner shows that the file was imported correctly:

(...)
INFO: Generic Test Executions Report
INFO: Parsing C:\source\tests.xml
INFO: Imported test execution data for 1 files
INFO: Sensor Generic Test Executions Report (done) | time=109ms

And I can see the test data in the SonarQube UI:

Answer 22 · 2018-06-21T13:25:00.000Z

Great! Thanks for confirming, @felipebz!