Field length calculation gives wrong results for some UTF8 encoded data

Question

Field length calculation gives wrong results for some UTF8 encoded data

sdanisch opened this issue 3 years ago · 2 comments

For some MARC files with multi-byte characters in them, the import happens with errors and results in wrong field content

see 007-tit.mrc.txt, first record, Field 515

even though the data seems to be perfectly fine and can be imported/validated with other tools. The problem seems to occur in FileMARC.cs#L302, where the extra bytes for a field are calculated:

                    //Check if there are multi-byte characters in the string
                    System.Globalization.StringInfo stringInfo = new System.Globalization.StringInfo(tagData);
                    int extraBytes = fieldLength - stringInfo.LengthInTextElements;
                    int extraBytes2 = Encoding.UTF8.GetByteCount(tagData) - fieldLength;
                    int endOfFieldIndex = tagData.IndexOf(END_OF_FIELD);

                    if (tagData.Length - 1 != endOfFieldIndex)
                    {
                        int differenceLength = tagData.Length - 1 - endOfFieldIndex;

                        if (differenceLength != extraBytes && differenceLength != extraBytes2)
                        {
                            fieldLength -= differenceLength;
                            totalExtraBytesRead += differenceLength;
                            tagData = raw.Substring(fieldStart, endOfFieldIndex + 1);
                        }
                        else
                        {
                            if (extraBytes > 0)
                            {
                                fieldLength -= extraBytes;
                                totalExtraBytesRead += extraBytes;
                                tagData = raw.Substring(fieldStart, fieldLength);
                            }
                            else if (extraBytes2 > 0)
                            {
                                fieldLength -= extraBytes2;
                                totalExtraBytesRead += extraBytes2;
                                tagData = raw.Substring(fieldStart, fieldLength);
                            }
                        }
                    }

It seems as if extraBytes2 has the correct value to be used, but extraBytes is used instead. Unfortunately i dont really understand, what is supposed to happen in this part of the code. For our internal import, changing the logic to use the larger value between extraBytes and extraBytes2 seems to work and leads to the field in question being read correctly:

                    //Check if there are multi-byte characters in the string
                    System.Globalization.StringInfo stringInfo = new System.Globalization.StringInfo(tagData);
                    int extraBytes =  fieldLength - stringInfo.LengthInTextElements;                    
                    int extraBytes2 = Encoding.UTF8.GetByteCount(tagData) - fieldLength;
                    extraBytes = extraBytes >= extraBytes2 ? extraBytes : extraBytes2;
                    int endOfFieldIndex = tagData.IndexOf(END_OF_FIELD);

                    if (tagData.Length - 1 != endOfFieldIndex)
                    {
                        int differenceLength = tagData.Length - 1 - endOfFieldIndex;

                        if (differenceLength != extraBytes)
                        {
                            fieldLength -= differenceLength;
                            totalExtraBytesRead += differenceLength;
                            tagData = raw.Substring(fieldStart, endOfFieldIndex + 1);
                        }
                        else
                        {
                            if (extraBytes > 0)
                            {
                                fieldLength -= extraBytes;
                                totalExtraBytesRead += extraBytes;
                                tagData = raw.Substring(fieldStart, fieldLength);
                            }                           
                        }
                    }

As im not quite sure what is supposed to happen here, i was hesitant to provide a pull request, but will of course do so if that is a feasible solution. If you see what is wrong with that approach and know how it should be handled instead, im more then willing to implement that solution and provide a matching pull request.

Answer 1 · 2022-02-01T04:32:37.000Z

It's been a really long time since I've gone over that parsing code. I suspect your solution is a good one, but I need to make sure I remember why I did it the way I did. I'm sure there was a reason for it.

Answer 2 · 2022-02-01T06:34:38.000Z

Okay I figured out what's going on here and there's a few failure points.

extraBytes vs extraBytes2 covers various encoding issues. The library sort of assumes MARC8 and/or generic windows encoding and tries it's best to compensate when it can't detect a UTF8 file.
The library was failing to properly detect a UTF8 file
Your test file's LEADER doesn't have the '8' flag in space 9 that tells the encoder to explicitly use UTF8

I added some code to let you toggle the UTF8 encoding, and to properly select the extraBytes2 when UTF8 encoding is found or requested.

FileMARCReader has a forceUTF8 optional parameter on it's constructor you can use. Alternatively, if you're using FileMARC without the reader, you can set the ForceUTF8 property to true after calling the FileMARC constructor. This will force reading files as UTF8 even if it sees a ' ' character in space 9 of the LEADER.

You can see an example of this working in FileMARCReaderTest.cs in the "UTF8Multibytetest()" function.