parsenames fails on non-ASCII characters
Closed this issue · 1 comments
joelnitta commented
This class of failures could probably be added to #5.
If a name has a non-ACII character (accent etc), parsing fails.
For example, if the file names.txt
to be parsed includes:
dc07-1|Asplenium serricula Fée
dc07-2|Asplenium serricula Fee
parsenames names.txt
returns
* Fail: 'Asplenium serricula Fée' does not match:
Asplenium serricula Fée <- parsed
dc07-1|
dc07-2||Asplenium||serricula|||Fee
camwebb commented
Thanks for this issue. It took me a while to track down, but the reason is that in your example the diacritic on the 'e' was a Unicode combining character rather than a single compound 'LATIN SMALL LETTER E WITH ACUTE'. The latter is picked up in the [:alnum:]
regex, but not the former. I added a set of common Unicode combining chars to the regex. Should work now.