inspirehep/inspire-schemas

utils: split_page_artid should handle unicode dashes

Closed this issue · 6 comments

From @jacquerie on June 7, 2017 9:33

Expected Behavior

split_page_artid should handle unicode dashes like \u2013 and \u2010.

See: inspirehep/inspire-next#2410 (comment)

Copied from original issue: inspirehep/inspire-next#2412

From @kaplun on June 22, 2017 12:21

@tsgit at some point produced an exhaustive lists of unicode dashes, that I guess we should support in general.

From @tsgit on June 22, 2017 17:58

so here is old email

There are multiple forms of unicode hyphens, e.g.

U+002D HYPHEN-MINUS
U+2010 HYPHEN
U+2011 NON-BREAKING HYPHEN
U+2012 FIGURE DASH
U+2013 EN DASH
U+2014 EM DASH
U+2015 HORIZONTAL BAR

and more obscure things like

U+058A ARMENIAN HYPHEN
U+05BE HEBREW PUNCTUATION MAQAF

However for many practical concerns, like reference linking and
citation counting it is important that only the ASCII hyphen-minus is
being used. I fixed some 440 records today which had an EN-DASH in the
page-range in 773__c, e.g.

Changed field 773__c from '489–510' to '489-510'

Congratulations if you can spot the difference on your display with
your choice of font.

I created a bibcheck rule for replacement of the most frequent
offender -- en-dash -- in page-ranges, see
inspirehep/inspire#174 however this problem
goes beyond just page-range. There are other fields in 773 with
en-dash in them

https://inspirehep.net/search?p=773%3A*%E2%80%93*

and many other MARC tags where the same applies.

What's labs doing about either normalizing such fields or defining
character equivalence classes in lookups?

From @tsgit on June 22, 2017 18:12

interestingly the "Hyphen Bullet"

http://www.fileformat.info/info/unicode/char/2043/index.htm

is in category Punctuation Other, not in Punctuation Dash

http://www.fileformat.info/info/unicode/category/Po/list.htm

From @michamos on June 23, 2017 7:50

In French, lists are traditionally done with dashes instead of bullets, I guess that's the proper unicode character for it.

What's labs doing about either normalizing such fields or defining
character equivalence classes in lookups?

On labs, no field should contain dashes as a range separator. Instead, fields have been split into start and end of range (e.g. https://github.com/inspirehep/inspire-schemas/blob/master/inspire_schemas/records/hep.yml#L1145-L1154 for the publication note). So the handling of dashes has to happen when writing into the record.

From @michamos on June 23, 2017 8:23

What about using unidecode here + post-processing for stripping repeated dashes? artid should be ascii AFAIK.

In [1]: from unidecode import unidecode

In [2]: dashes = (u'\u002d', u'\u2010', u'\u2011', u'\u2012', u'\u2013', u'\u2014', u'\u2015', u'\u058a', u'\u05be', u'\u2043')

In [3]: [unidecode(dash) for dash in dashes]
Out[3]: ['-', '-', '-', '-', '-', '--', '--', '-', '', '--']

u+05be looks like a bug in unidecode. I sent a PR in avian2/unidecode#12.