PRImA-Research-Lab/PAGE-XML

standard/norm for LanguageSimpleType

bertsky opened this issue · 4 comments

In PAGE-XML there's @language / @primaryLanguage of type pc:LanguageSimpleType to identify the natural language of segments. Its documentation refers to ISO 639.x 2016-07-14, which I cannot make sense of. There's 639-1, 639-2 and 639-3, but AFAICT no standard that allows strings of arbitrary length (as in the PAGE-XML enumeration), and nothing shows up for 2016-07-14. This is problematic because exact 639 mappings are needed for software implementation and interoperability.

Take Norwegian for example:

                       <enumeration value="Norwegian"/>
                        <enumeration value="Norwegian Bokmål"/>
                        <enumeration value="Norwegian Nynorsk"/>

According to 639 these could be named no/nb/nn or nor/nob/nno, but how do we map that automatically, where do the strings derive from in PAGE-XML?

(Likewise, IIUC, only the first part of the ScriptSimpleType enums is actually ISO 15924, so these would have to be split at -.)

So IMO what needs to be done is:

  1. In the next namespace version of PAGE-XML, change ScriptSimpleType to conform to ISO 15924 and LanguageSimpleType to conform to ISO 639.
  2. Provide a (manually crafted) transformation stylesheet mapping the existing, non-standardized xs:restriction strings to the new, standard ones. (That stylesheet can then be used by applications/users to update from the 2019 schema, or independently to interoperate with language and script values for PAGE-XML files up to 2019.)

Oh, and there's a file here documentation/Language List (from ISO).xlsx – but it does not contain a complete mapping of all language strings against their 639 codes.