kba/hocr-spec

lang tags: using BCP47 instead of ISO639-1 codes

Opened this issue · 2 comments

eroux commented

Hello, first thank you very much for your work on hocr! I'm part of an organization that gets hocr from Google Books and I'm quite new to the specification. Something that caught my eye is the reference to ISO639-1 for language codes. Since it doesn't contain all language codes, I think referring to BCP47 is more generic and future-proof. What do you think? It's a retro-compatible change since ISO639-1 tags are BCP47 compliant (at least in a first approximation)

kba commented

I don't feel strongly either way, but it might be a good opportunity to align with how ALTO and PAGE handle language/script.

In ALTO we decided on using what xsd:language expects, i.e. RFC 1766, which in turn references ISO639-1. IIUC this might not be expressive enough for your puproses?

eroux commented

thanks for your answer!

My understanding of the latest XSD spec is that it requires BCP47 lang tags, the 1.0 spec indeed refers to RFC1766. I don't think there might be any reason why RFC1766 should be recommended instead of BCP47, but perhaps there are some?