standard/norm for LanguageSimpleType
bertsky opened this issue · 4 comments
In PAGE-XML there's @language
/ @primaryLanguage
of type pc:LanguageSimpleType
to identify the natural language of segments. Its documentation refers to ISO 639.x 2016-07-14
, which I cannot make sense of. There's 639-1, 639-2 and 639-3, but AFAICT no standard that allows strings of arbitrary length (as in the PAGE-XML enumeration), and nothing shows up for 2016-07-14
. This is problematic because exact 639 mappings are needed for software implementation and interoperability.
Take Norwegian for example:
<enumeration value="Norwegian"/>
<enumeration value="Norwegian Bokmål"/>
<enumeration value="Norwegian Nynorsk"/>
According to 639 these could be named no/nb/nn or nor/nob/nno, but how do we map that automatically, where do the strings derive from in PAGE-XML?
(Likewise, IIUC, only the first part of the ScriptSimpleType
enums is actually ISO 15924, so these would have to be split at -
.)
So IMO what needs to be done is:
- In the next namespace version of PAGE-XML, change
ScriptSimpleType
to conform to ISO 15924 andLanguageSimpleType
to conform to ISO 639. - Provide a (manually crafted) transformation stylesheet mapping the existing, non-standardized
xs:restriction
strings to the new, standard ones. (That stylesheet can then be used by applications/users to update from the 2019 schema, or independently to interoperate with language and script values for PAGE-XML files up to 2019.)
This is @kba's workaround for the ISO 639 codes in Python (using https://github.com/LuminosoInsight/langcodes):
https://github.com/kba/page-to-alto/blob/f1b67bdf70b24e6d6904ad4ba4e83ce276923aca/ocrd_page_to_alto/utils.py#L29
Oh, and there's a file here documentation/Language List (from ISO).xlsx
– but it does not contain a complete mapping of all language strings against their 639 codes.