yaz-marcdump silently creates MARC-records with an invalid directory
toKrause opened this issue · 2 comments
Hi,
I've just stumbled upon this potential issue.
If a MARC-XML-record contains a controlfield
or a datafied
whose tag
-attrubute value has less than three characters, yaz-marcdump
will successfully create a MARC-record, but the corresponding directory entry will be too short too. This causes problems further down the line, because MARC parsers, including yaz-marcdump
, won't be able to handle these records.
Interestingly, if the tag
-attrubute value is longer than three characters, yaz-marcdump
repairs the record by truncating the value.
Here is an example MARC-XML-collection:
<?xml version="1.0"?>
<marc:collection xmlns:marc="http://www.loc.gov/MARC21/slim">
<marc:record>
<marc:datafield tag="98765" ind1=" " ind2=" ">
<marc:subfield code="a">Test</marc:subfield>
</marc:datafield>
</marc:record>
<marc:record>
<marc:datafield tag="987" ind1=" " ind2=" ">
<marc:subfield code="a">Test</marc:subfield>
</marc:datafield>
</marc:record>
<marc:record>
<marc:datafield tag="9" ind1=" " ind2=" ">
<marc:subfield code="a">Test</marc:subfield>
</marc:datafield>
</marc:record>
</marc:collection>
The first record has a tag that is too long, 98765
, which gets truncated to 987
. The third record has a tag that is too short, 9
, which is used as is. The resulting directory entry is too short.
$ $ yaz-marcdump -i marcxml -o marc example.xml | sed 's/\x1d/\x1d\n/g'
00047nam a22000370a 4500987000900000� �aTest��
00047nam a22000370a 4500987000900000� �aTest��
00045nam a22000350a 45009000900000� �aTest��
In both cases no warning or error is produced by yaz-marcdump
. Personally, I'd prefer it, if yaz-marcdump
would reject such records in both cases (similarly to records with elements that are not in the expected order).
If, on the other hand, it is desired that yaz-marcdump
processes as many records as possible, tags that are too short should also be repaired; perhaps by padding the value with zeros?
I'm working with YAZ version: 5.28.0 0037f7c30d59eaea44fcc40c237641784c50b582
.
Sadly, I've no control over the MARC-XML data that I'm processing, because I'm working with real world data that is collected from real word data sources.
Interesting report -- thank you for the detective work and the comprehensive report, Torsten. We will take a look at it.
Fixed in master and will be part of next release