camwebb/taxon-tools

parsenames fails on non-ASCII characters

Closed this issue · 1 comments

This class of failures could probably be added to #5.

If a name has a non-ACII character (accent etc), parsing fails.

For example, if the file names.txt to be parsed includes:

dc07-1|Asplenium serricula Fée
dc07-2|Asplenium serricula Fee

parsenames names.txt returns

*  Fail: 'Asplenium serricula Fée' does not match:
         Asplenium serricula Fée  <- parsed

dc07-1|
dc07-2||Asplenium||serricula|||Fee

Thanks for this issue. It took me a while to track down, but the reason is that in your example the diacritic on the 'e' was a Unicode combining character rather than a single compound 'LATIN SMALL LETTER E WITH ACUTE'. The latter is picked up in the [:alnum:] regex, but not the former. I added a set of common Unicode combining chars to the regex. Should work now.