deutsche-nationalbibliothek/pica-rs

Transliterate of matcher expressions

Closed this issue · 0 comments

  • Responsible: @niko2342
  • Status: accepted
  • Feature PR: #592

Summary

If the unicode normalform (NF) of the PICA+ data is different from that of a matcher expression (filter, selector or path), it's difficult to get a string match, because the values in the matcher expressions must be aligned with the NF of the data. For performance reason a conversion of the PICA+ data is unwanted and therefore the matcher expression must be aligned with the NF of the data. This RFCs focus on the use case, that usually the user knows the NF of the PICA+ data and this does not change often. A command-line option can be added later on.

Details

To achieve a conversion of the NF a new config parameter translit is introduced, where the NF of the underlaying data can be configured. The values nfc, nfd, nfkc and nfkd are valid. Because the transliteration is needed in multiple commands, a the config section [global] will be used.

In the following Pica.toml all filter expressions will be transliterated to NFD:

[global]
translit = "nfd"

The unicode transliteration is an idempotent operation and therefore it's not a problem to transliterate to the same NF if both NFs already matches.

The advantage of this approach is that the transliteration is done only once, instead of validating and transliterate the PICA+ records, which is a very, very expensive operation.

Note: This feature doesn't change the semantic of the --translit option of the so called exit-commands (select, frequency). Whereas the --translit options changes the output of the command, the new config option changes the NF of the matcher expression.

Implementation

This feature can be easily implemented by adding this optional section to the config and perform the transliteration in the commands.

Related Issues