indexdata/yaz

yaz-marcdump silently creates MARC-records with an invalid directory

toKrause opened this issue · 2 comments

Hi,

I've just stumbled upon this potential issue.

If a MARC-XML-record contains a controlfield or a datafied whose tag-attrubute value has less than three characters, yaz-marcdump will successfully create a MARC-record, but the corresponding directory entry will be too short too. This causes problems further down the line, because MARC parsers, including yaz-marcdump, won't be able to handle these records.

Interestingly, if the tag-attrubute value is longer than three characters, yaz-marcdump repairs the record by truncating the value.

Here is an example MARC-XML-collection:

<?xml version="1.0"?>
<marc:collection xmlns:marc="http://www.loc.gov/MARC21/slim">
  <marc:record>
    <marc:datafield tag="98765" ind1=" " ind2=" ">
      <marc:subfield code="a">Test</marc:subfield>
    </marc:datafield>
  </marc:record>
  <marc:record>
    <marc:datafield tag="987" ind1=" " ind2=" ">
      <marc:subfield code="a">Test</marc:subfield>
    </marc:datafield>
  </marc:record>
  <marc:record>
    <marc:datafield tag="9" ind1=" " ind2=" ">
      <marc:subfield code="a">Test</marc:subfield>
    </marc:datafield>
  </marc:record>
</marc:collection>

The first record has a tag that is too long, 98765, which gets truncated to 987. The third record has a tag that is too short, 9, which is used as is. The resulting directory entry is too short.

$ $ yaz-marcdump -i marcxml -o marc example.xml | sed 's/\x1d/\x1d\n/g'
00047nam a22000370a 4500987000900000�  �aTest��
00047nam a22000370a 4500987000900000�  �aTest��
00045nam a22000350a 45009000900000�  �aTest��

In both cases no warning or error is produced by yaz-marcdump. Personally, I'd prefer it, if yaz-marcdump would reject such records in both cases (similarly to records with elements that are not in the expected order).

If, on the other hand, it is desired that yaz-marcdump processes as many records as possible, tags that are too short should also be repaired; perhaps by padding the value with zeros?

I'm working with YAZ version: 5.28.0 0037f7c30d59eaea44fcc40c237641784c50b582.

Sadly, I've no control over the MARC-XML data that I'm processing, because I'm working with real world data that is collected from real word data sources.

dltj commented

Interesting report -- thank you for the detective work and the comprehensive report, Torsten. We will take a look at it.

Fixed in master and will be part of next release