utils: split_page_artid should handle unicode dashes
Closed this issue · 6 comments
From @jacquerie on June 7, 2017 9:33
Expected Behavior
split_page_artid
should handle unicode dashes like \u2013
and \u2010
.
See: inspirehep/inspire-next#2410 (comment)
Copied from original issue: inspirehep/inspire-next#2412
From @tsgit on June 22, 2017 17:58
so here is old email
There are multiple forms of unicode hyphens, e.g.
U+002D HYPHEN-MINUS
U+2010 HYPHEN
U+2011 NON-BREAKING HYPHEN
U+2012 FIGURE DASH
U+2013 EN DASH
U+2014 EM DASH
U+2015 HORIZONTAL BAR
and more obscure things like
U+058A ARMENIAN HYPHEN
U+05BE HEBREW PUNCTUATION MAQAF
However for many practical concerns, like reference linking and
citation counting it is important that only the ASCII hyphen-minus is
being used. I fixed some 440 records today which had an EN-DASH in the
page-range in 773__c, e.g.
Changed field 773__c from '489–510' to '489-510'
Congratulations if you can spot the difference on your display with
your choice of font.
I created a bibcheck rule for replacement of the most frequent
offender -- en-dash -- in page-ranges, see
inspirehep/inspire#174 however this problem
goes beyond just page-range. There are other fields in 773 with
en-dash in them
https://inspirehep.net/search?p=773%3A*%E2%80%93*
and many other MARC tags where the same applies.
What's labs doing about either normalizing such fields or defining
character equivalence classes in lookups?
From @tsgit on June 22, 2017 17:59
The unicode tables themselves are useful, and so is the link you dug out.
I particularly like the "See Also" feature at fileformat.info, e.g.
http://www.fileformat.info/info/unicode/char/2d/index.htm
similarly for apostrophe
http://www.fileformat.info/info/unicode/char/0027/index.htm
and space
http://www.fileformat.info/info/unicode/char/0020/index.htm
there are categories
http://www.fileformat.info/info/unicode/category/index.htm
e.g.
http://www.fileformat.info/info/unicode/category/Pd/list.htm
http://www.fileformat.info/info/unicode/category/Zs/list.htm
From @tsgit on June 22, 2017 18:12
interestingly the "Hyphen Bullet"
http://www.fileformat.info/info/unicode/char/2043/index.htm
is in category Punctuation Other
, not in Punctuation Dash
http://www.fileformat.info/info/unicode/category/Po/list.htm
From @michamos on June 23, 2017 7:50
In French, lists are traditionally done with dashes instead of bullets, I guess that's the proper unicode character for it.
What's labs doing about either normalizing such fields or defining
character equivalence classes in lookups?
On labs, no field should contain dashes as a range separator. Instead, fields have been split into start and end of range (e.g. https://github.com/inspirehep/inspire-schemas/blob/master/inspire_schemas/records/hep.yml#L1145-L1154 for the publication note). So the handling of dashes has to happen when writing into the record.
From @michamos on June 23, 2017 8:23
What about using unidecode
here + post-processing for stripping repeated dashes? artid
should be ascii AFAIK.
In [1]: from unidecode import unidecode
In [2]: dashes = (u'\u002d', u'\u2010', u'\u2011', u'\u2012', u'\u2013', u'\u2014', u'\u2015', u'\u058a', u'\u05be', u'\u2043')
In [3]: [unidecode(dash) for dash in dashes]
Out[3]: ['-', '-', '-', '-', '-', '--', '--', '-', '', '--']
u+05be looks like a bug in unidecode
. I sent a PR in avian2/unidecode#12.