nHapiNET/nHapi

Add CharSetUtil and PreParser for parsing MSH-18 charset before main parsing

jakubsuchybio opened this issue Β· 17 comments

Is your feature request related to a problem? Please describe.
I have ported our java HL7 service into C# few years back, but few features are still missing.
We will soon try to implement some of them. And one of them is parsing charset from the MSH-18

Describe the solution you'd like
In Hapi, there is CharSetUtil that can get charset from MSH
https://github.com/hapifhir/hapi-hl7v2/blob/master/hapi-base/src/main/java/ca/uhn/hl7v2/llp/CharSetUtil.java
but it uses PreParser to get the charset from MSH-18
https://github.com/hapifhir/hapi-hl7v2/blob/master/hapi-base/src/main/java/ca/uhn/hl7v2/preparser/PreParser.java

Describe alternatives you've considered
I might do it by myself, but I think this is part of Hapi, that should be implemented in nHapi, not in our code.

@jakubsuchybio if you do get time please feel free to create a PR as I agree this should be in nHapi since its port of hapi.

@jakubsuchybio I have started looking at this, but haven't made lots of progress, some of the PreParser dependencies are crazy to understand and port (without taking additional dependencies).

Its doable its just best not to be done in the same way as hapi.

@milkshakeuk
I just looked a bit deeper into this without any coding done yet, but here are my thoughts on this.
We could be the same as hapi (implementing preparser), BUT i don't think this is needed for nHapi.
The only reason for having preparser is to read MSH-18, so that we can read the charset and then parse the message with nHapi's parser in correct encoding.

So, to actually getting the charset we need these steps:

  1. Detect BOM (if we have it, great, if not, use ASCII for parsing MSH)
  2. Manually parse MSH-18 (we already have some implementation for this at our code from the port from java to C#)

Here is the parsing of MSH-18 (it is not polished, it wasn't touched in years πŸ˜“) it tries to get the Encoding from our identifier mapping, but this method could just return the string in MSH-18 without our proprietary mapping and that would just be it.

    public enum CharsetState
    {
        NotFound,
        NotPresent,
        Present
    }

    public static class PipeParserEx
    {
        public static Tuple<CharsetState, Encoding> GetCharsetFromMSH(string message, IList<IdentifierData> charsetIndetifierData)
        {
            var defaultEncoding = Encoding.ASCII;

            var locStartMSH = message.IndexOf("MSH");
            // MSH segment not found
            if (locStartMSH < 0)
                return new Tuple<CharsetState, Encoding>(CharsetState.NotFound, defaultEncoding);

            var endMSH = message.IndexOf('\r', locStartMSH + 1);
            var locEndMSH = endMSH;
            if (locEndMSH < 0)
                locEndMSH = message.Length;

            var mshString = message.Substring(locStartMSH, (locEndMSH) - (locStartMSH));

            // Not enough lenght for a message
            if (mshString.Length < 4)
                return new Tuple<CharsetState, Encoding>(CharsetState.NotFound, defaultEncoding);

            var fieldSep = mshString[3];
            var fields = PipeParser.Split(mshString, Convert.ToString(fieldSep));

            // Not enough length for getting charset
            if (fields.Length < 18)
                return new Tuple<CharsetState, Encoding>(endMSH < 0 ? CharsetState.NotFound : CharsetState.NotPresent, defaultEncoding);

            var encoding = HL7CharsetTools.GetCharsetFromHL7Name(fields[17], defaultEncoding, charsetIndetifierData);
            return new Tuple<CharsetState, Encoding>(CharsetState.Present, encoding);
        }
    }

And the BOM detection is quite easy. Just getting through the bom types, trying the first bytes that are equal with some bom and if so, then use that encoding otherwise use ASCII if nothing found.

I will try to implement this by the end of this week. Looking into this deeper made it more clear and easy.
What do you think?
Maybe there will be some use of PreParser too, but I don't think it is needed for this purpose. If there would be more use cases of it, then yes, but not for now.

According to the hapi docs the preparser is not just for reading MSH-18, that might be one use case but it allows you to pull any field from the message before parsing. This would be useful to have actually.

This is the comment above the class in hapi.

 * <p>Extracts specified fields from unparsed messages.  This class is a 
 * facade for the ER7 and XML classes.  Use it like this: </p>
 * 
 * <code>
 * String message = null; //... your ER7 or XML message string goes here
 * String[] fieldSpecs = {"MSH-9-1", "MSH-9-2", "MSH-12"};
 * String[] fields = PreParser.getFields(message, fieldSpecs);
 * </code>

Yes, I saw that and PreParser is a strong utility for sure.

But nHapi can't operate with xml yet right? So only the ER7 part of PreParser makes sense to be implemented. Or am I missing something?

I didn't look at the prerequisities of PreParser in Hapi, but from what I saw it only needs Tokenizer and some string manipulations like Split, which our Parser already has.

However this is also quite dangerous utility, because I'm not sure if hapi's implementation is correct actually. Because of the fact for which this issue was created. And that is to get charset (encoding) from the message before parsing. But when you are using PreParser for parsing something it should use the same principle. Parse the MSH with BOM (or ASCII if BOM is not present) and then parse the rest of the message with that encoding from MSH-18. AFAIK hapi doesn't even do this. So that's why I think hapi's implementation of it is wrong.

Needless to say, you would rather prefer to have full implementation of PreParser and not just "PreParser of MSH-18"?

I will start with the BOM part in the feature branch, then I will try to do the PreParser, but I'm kinda unmotivated to do the full PreParser when there is only one use case for it, which doesn't need it in full I think. 😭

Actually I think the correct way to implement PreParser is to do 2 different PreParsers. First that PreParses MSH and Second that PreParses anything else. The First one would use BOM detection or input Encoding. The Second one would use only input Encoding for PreParsing. What do you think?

If the PreParser could be ported (which it can, its just a little painful) then it would be able to do both right?

Well yes, the only difference is the encoding used for transforming byte[] into string, which is then PreParsed.
If the MSH is getting PreParsed it needs to use the encoding from BOM or ASCII.
If anything else is getting PreParsed it needs to use the encoding from MSH-18 or some predefined receiving encoding (from some settings)

It can be done in one function, because I can see the use case of getting multiple fields where some are from MSH and some are from other parts of the message.

I didn't see Hapi's implementation to take this into account.

from what I can tell it looks like hapi might do this.

@jakubsuchybio I'm getting somewhere with the XML PreParser now, might have cracked the hardest to port bit since Hapi (java) uses SAXParser to iterate the xml in a very strange way and dotnet doesn't do anything like that.

Given the following XML:

<ADT_A01 xmlns="urn:hl7-org:v2xml">
    <MSH>
        <MSH.1>|</MSH.1>
        <MSH.2>^~\&amp;</MSH.2>
        <MSH.3>
            <HD.1>x</HD.1>
        </MSH.3>
        <MSH.4>
            <HD.1>x</HD.1>
        </MSH.4>
        <MSH.5>
            <HD.1>x</HD.1>
        </MSH.5>
        <MSH.6>
            <HD.1>x</HD.1>
        </MSH.6>
        <MSH.7>
            <TS.1>199904140038</TS.1>
        </MSH.7>
        <MSH.9>
            <MSG.1>ADT</MSG.1>
            <MSG.2>A01</MSG.2>
        </MSH.9>
        <MSH.11>
            <PT.1>P</PT.1>
        </MSH.11>
        <MSH.12>
            <VID.1>2.4</VID.1>
        </MSH.12>
    </MSH>
    <PID>
        <PID.5>
            <XPN.1>
                <FN.1>Smith</FN.1>
                <FN.2>Booth</FN.2>
                <FN.3>Jones</FN.3>
            </XPN.1>
            <XPN.2>X</XPN.2>
            <XPN.3>Y</XPN.3>
        </PID.5>
    </PID>
    <NTE>
        <NTE.1>a</NTE.1>
        <NTE.3>one</NTE.3>
        <NTE.3>two</NTE.3>
        <NTE.3>three</NTE.3>
    </NTE>
    <NTE>
        <NTE.1>b</NTE.1>
        <NTE.3>four</NTE.3>
    </NTE>
</ADT_A01>

and this implementation for using XDocument and DatumPath to find values:

[TestCase("NTE(0)-1", "a")]
[TestCase("NTE(1)-1", "b")]
[TestCase("NTE-3(0)", "one")]
[TestCase("NTE-3(1)", "two")]
[TestCase("PID-5", "Smith")]
[TestCase("PID-5-1-1", "Smith")]
[TestCase("MSH-9-2", "A01")]
[TestCase("PID-5-1-2", "Booth")]
public void testnew2(string pathSpec, string expectedValue)
{
    var testString = "<?xml version=\"1.0\" standalone=\"no\"?><ADT_A01 xmlns=\"urn:hl7-org:v2xml\"><MSH><MSH.1>|</MSH.1><MSH.2>^~\\&amp;</MSH.2><MSH.3><HD.1>x</HD.1></MSH.3><MSH.4><HD.1>x</HD.1></MSH.4><MSH.5><HD.1>x</HD.1></MSH.5><MSH.6><HD.1>x</HD.1></MSH.6><MSH.7><TS.1>199904140038</TS.1></MSH.7><MSH.9><MSG.1>ADT</MSG.1><MSG.2>A01</MSG.2></MSH.9><MSH.11><PT.1>P</PT.1></MSH.11><MSH.12><VID.1>2.4</VID.1></MSH.12></MSH><PID><PID.5><XPN.1><FN.1>Smith</FN.1><FN.2>Booth</FN.2><FN.3>Jones</FN.3></XPN.1><XPN.2>X</XPN.2><XPN.3>Y</XPN.3></PID.5></PID><NTE><NTE.1>a</NTE.1><NTE.3>one</NTE.3><NTE.3>two</NTE.3><NTE.3>three</NTE.3></NTE><NTE><NTE.1>b</NTE.1><NTE.3>four</NTE.3></NTE></ADT_A01>";

    var document = XDocument.Parse(testString);
    var nameSpace = document.Root?.Name.Namespace;

    var datumPath = pathSpec.FromPathSpec();
    var segment = datumPath.Get(0) as string;

    var search = document.Root?.Elements(nameSpace + segment)
        .ElementAt((int)datumPath.Get(1));
    if (search.HasElements)
    {
        search = search.Elements()
            .Where(x => x.Name.LocalName.EndsWith($"{(int)datumPath.Get(2)}"))
            .ElementAt((int)datumPath.Get(3));
    }

    if (search.HasElements)
    {
        search = search.Elements()
            ?.SingleOrDefault(x => x.Name.LocalName.EndsWith($"{(int)datumPath.Get(4)}"));
    }

    if (search.HasElements)
    {
        search = search.Elements()
            ?.SingleOrDefault(x => x.Name.LocalName.EndsWith($"{(int)datumPath.Get(5)}"));
    }

    Assert.AreEqual(expectedValue, search?.Value);
}

image
I know this implementation is just in the unit test but I'll move it into the proper XML class.
Hopefully shouldn't be too hard to do the rest as I can mostly just port the hapi code.

@jakubsuchybio just doing some final prep for the pull request, 🀞this week it will be up.

@jakubsuchybio I just realised the methods CheckCharset in CharSetUtility in hapi are internal, I don't think they are meant to be accessed outside of the library, they are used as part of MLLP support (which nhapi doesn't have at the moment).

So I am not sure how useful they would be, how was your old java propgram doing this? I suppose you could use reflection

Well, we implement our own MLLP support in which we would use it from nHapi.

So maybe until MLLP support is in nHapi, maybe this CheckCharset could be public?

I looked into our old Java code and I found that we had our own reimplementation of LLP folder which was in hapi.base.
We were using hapi v2.2
I don't know the specifics of this reimplementation, but for example that CheckCharset was public there. So maybe that was one of the reasons.

However when I was migrating the old java code into C#. I didn't quite like the OOP from the old java project. So I did it as simple as possible reimplementing the LLP from scratch in C#. So it is quite different from it was. And I would use this CHeckCharset in this implementation.

If you would like, I could share our LLP implementation with you guys.
I don't know if you would like to use it for nHapi or if you would like the implementation based on hapi project.

But I tried to make it as asynchronous as possible, which I think java implementation don't have AFAIK.

@jakubsuchybio, yeah we could make it public for now at least and sure it might be worth having a look at other MLLP implementations, I know HL7Fuse has an implementation also.

@jakubsuchybio this change is in the latest 3.2.0 version.

@jakubsuchybio let me know if it does what you need it to, so you can finally finish the missing bits from your service.

I'm not quite sure when I will get some time slot at work to integrate these changes into our codebase. Hopefully in Q1 πŸ˜…. Thank you for your hard work. Will keep you posted when it is integrated and tested.