clarin-eric/ParlaMint

Displaying USAS tags in concordancers

Closed this issue ยท 5 comments

The conversion of TEI to vertical files should also implement USAS semantics for tokens and MWEs. Ideally:

  • the original pymusas tags present and multivalued (probably MULTISEP ",")
  • the categories are given as their glosses
  • categories are also multivalued; the separator must never be in glosses, safest (but clumsy) is probably MULTISEP "|"

e.g.

<phr sem_tag="Z1mf,Z3c" sem="Z1: Personal names">
Mr.             Mr.           Mr.     PROPN   Number=Sing     Z1mf,Z3c        Z1: Personal names      t1
President       President       President       PROPN   Number=Sing     Z1mf,Z3c        Z1: Personal names      t2
<g/>
</phr>

Currently the code for converting TEI to vertical is just a stub, for <phr> elements:

<!-- MWEs with semantic information -->
<xsl:template match="tei:phr[@type = 'sem']">
<xsl:copy>
<xsl:attribute name="sem_all" select="replace(@function, ',', '|')"/>
<xsl:variable name="sem">
<xsl:for-each select="tokenize(@ana, ' ')">
<!-- Here we a) assume that the catDesc is only in English and b) that the extended pointer resolves to a local reference -->
<xsl:value-of select="key('id', substring-after(., ':'), $rootHeader)/tei:catDesc/tei:term"/>
<xsl:text>|</xsl:text>
</xsl:for-each>
</xsl:variable>
<xsl:attribute name="sem" select="replace($sem, '\|$', '')"/>
<xsl:text>&#10;</xsl:text>
<xsl:apply-templates/>
</xsl:copy>
<xsl:text>&#10;</xsl:text>
</xsl:template>

and for positional attributes:
<!-- Part 2 are semantic attributes, but they appear in MTed vert only, so check if they exist and insert only if they do -->
<xsl:variable name="part2">
<xsl:if test="$token/@function and $token/@ana">
<xsl:variable name="sem-all" select="replace($token/@function, ',', '|')"/>
<xsl:variable name="sem">
<xsl:for-each select="tokenize($token/@ana, ' ')">
<!-- Here we a) assume that the catDesc is only in English and b) that the extended pointer resolves to a local reference -->
<xsl:value-of select="key('id', substring-after(., ':'), $rootHeader)/tei:catDesc/tei:term"/>
<xsl:text>|</xsl:text>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="concat($sem-all, '&#9;', replace($sem, '\|$', ''), '&#9;')"/>
</xsl:if>
</xsl:variable>

If anybody, esp. @matyaskopp or @perayson have any oppinion on this, I'd be glad to hear it.

I am in favour of keeping separators the same because it is starting to be quite complicated:

  • use , for separating tags
  • use / inside tags for searching
  • if you want to use glosses, then you can introduce new separator, because it is not present in any data format

So I am suggesting three columns (word candidate):

  • sem_all_tags: G1.2/S2mf,I3.1/S2mf,P1/S2mf,A7+,A1.2+
  • sem_tag: G1.2/S2
  • sem_gloss: Politics|People (not sure about separator and column name)

In the meantime I did improve the code for converting USAS to vdertical a bit, and this is the current snippet from the registry that gives the names of the attributes and the multi-value separators:

ATTRIBUTE usas_tags {
  TYPE "MD_MGD"
  LABEL "USAS tags"
  MULTIVALUE yes
  MULTISEP ","
}
ATTRIBUTE usas_cats {
  TYPE "MD_MGD"
  LABEL "USAS categories"
  MULTIVALUE yes
  MULTISEP " "
}
ATTRIBUTE usas_full {
  TYPE "MD_MGD"
  LABEL "USAS glosses"
  MULTIVALUE yes
  MULTISEP "|"
}

Maybe this should be changed (but I'd change it only once becaue all the vertical files need to be recompiled), maybe like this:

  • use / as multisep for usas_cats as you suggested
  • rename usas_full to usas_glosses

Anyway, I am open to suggestions, we can still change this. Btw. to me "usas" seemed better than "sem", because it is more specific. And if people don't know usas is semantics, then they probably won't be able to use these tags anyway. Still, not sure here either, what do you think?

A test corpus with only 3 ccorpora is available for testing on https://www.clarin.si/ske-beta/#dashboard?corpname=parlamint40_xx_en

One thing that doesn't work, and it is a big shame, is keywords over usas_glosses. I made a covid subcorpus and computed keywords agains the complete corpus over usas_cats, which works fine, and usas_glosses which returns no results. But the two attributes are isomorhpic, i.e. 1 usas_cat corresponds to 1 usas_full. I have no idea why it doesn't work, maybe I need to write to Lexical Computing...

  • use / as multisep for usas_cats as you suggested

Now I have discovered, that using / inside values is not good, because it is default noSketch separator.
usas_tags/usas_cats/usas_glosses
image

  • rename usas_full to usas_glosses

I like usas_glosses it is more understandable for me

Now I have discovered, that using / inside values is not good, because it is default noSketch separator.

Good point. Won't change it.

I like usas_glosses it is more understandable for me

OK, changed in ab73fda.
And, sorry, was working directly on main branch, will merge it into devel and switch.

This is now finished. The "/" still conflicts with noSkE delimiter, as it appears in USAS tags but I think it could cause even more confusion if it were changed, as this is the conjunctive delimiter in USAS, and changing it would confuse people looking at the USAS specs

image

Closing.