standard/norm for LanguageSimpleType

Question

standard/norm for LanguageSimpleType

bertsky opened this issue 4 years ago · 4 comments

In PAGE-XML there's @language / @primaryLanguage of type pc:LanguageSimpleType to identify the natural language of segments. Its documentation refers to ISO 639.x 2016-07-14, which I cannot make sense of. There's 639-1, 639-2 and 639-3, but AFAICT no standard that allows strings of arbitrary length (as in the PAGE-XML enumeration), and nothing shows up for 2016-07-14. This is problematic because exact 639 mappings are needed for software implementation and interoperability.

Take Norwegian for example:

                       <enumeration value="Norwegian"/>
                        <enumeration value="Norwegian Bokmål"/>
                        <enumeration value="Norwegian Nynorsk"/>

According to 639 these could be named no/nb/nn or nor/nob/nno, but how do we map that automatically, where do the strings derive from in PAGE-XML?

Answer 1 · 2021-01-22T15:10:41.000Z

(Likewise, IIUC, only the first part of the ScriptSimpleType enums is actually ISO 15924, so these would have to be split at -.)

Answer 2 · 2021-02-15T11:40:28.000Z

So IMO what needs to be done is:

In the next namespace version of PAGE-XML, change ScriptSimpleType to conform to ISO 15924 and LanguageSimpleType to conform to ISO 639.
Provide a (manually crafted) transformation stylesheet mapping the existing, non-standardized xs:restriction strings to the new, standard ones. (That stylesheet can then be used by applications/users to update from the 2019 schema, or independently to interoperate with language and script values for PAGE-XML files up to 2019.)

Answer 3 · 2021-04-12T10:06:23.000Z

This is @kba's workaround for the ISO 639 codes in Python (using https://github.com/LuminosoInsight/langcodes):
https://github.com/kba/page-to-alto/blob/f1b67bdf70b24e6d6904ad4ba4e83ce276923aca/ocrd_page_to_alto/utils.py#L29

Answer 4 · 2021-04-12T16:06:28.000Z

Oh, and there's a file here documentation/Language List (from ISO).xlsx – but it does not contain a complete mapping of all language strings against their 639 codes.