Displaying USAS tags in concordancers
Closed this issue ยท 5 comments
The conversion of TEI to vertical files should also implement USAS semantics for tokens and MWEs. Ideally:
- the original pymusas tags present and multivalued (probably MULTISEP ",")
- the categories are given as their glosses
- categories are also multivalued; the separator must never be in glosses, safest (but clumsy) is probably MULTISEP "|"
e.g.
<phr sem_tag="Z1mf,Z3c" sem="Z1: Personal names">
Mr. Mr. Mr. PROPN Number=Sing Z1mf,Z3c Z1: Personal names t1
President President President PROPN Number=Sing Z1mf,Z3c Z1: Personal names t2
<g/>
</phr>
Currently the code for converting TEI to vertical is just a stub, for <phr>
elements:
ParlaMint/Scripts/parlamint2xmlvert.xsl
Lines 146 to 162 in 301752b
and for positional attributes:
ParlaMint/Scripts/parlamint-lib.xsl
Lines 882 to 895 in 301752b
If anybody, esp. @matyaskopp or @perayson have any oppinion on this, I'd be glad to hear it.
I am in favour of keeping separators the same because it is starting to be quite complicated:
- use
,
for separating tags - use
/
inside tags for searching - if you want to use glosses, then you can introduce new separator, because it is not present in any data format
So I am suggesting three columns (word candidate
):
- sem_all_tags:
G1.2/S2mf,I3.1/S2mf,P1/S2mf,A7+,A1.2+
- sem_tag:
G1.2/S2
- sem_gloss:
Politics|People
(not sure about separator and column name)
In the meantime I did improve the code for converting USAS to vdertical a bit, and this is the current snippet from the registry that gives the names of the attributes and the multi-value separators:
ATTRIBUTE usas_tags {
TYPE "MD_MGD"
LABEL "USAS tags"
MULTIVALUE yes
MULTISEP ","
}
ATTRIBUTE usas_cats {
TYPE "MD_MGD"
LABEL "USAS categories"
MULTIVALUE yes
MULTISEP " "
}
ATTRIBUTE usas_full {
TYPE "MD_MGD"
LABEL "USAS glosses"
MULTIVALUE yes
MULTISEP "|"
}
Maybe this should be changed (but I'd change it only once becaue all the vertical files need to be recompiled), maybe like this:
- use
/
as multisep for usas_cats as you suggested - rename
usas_full
tousas_glosses
Anyway, I am open to suggestions, we can still change this. Btw. to me "usas" seemed better than "sem", because it is more specific. And if people don't know usas is semantics, then they probably won't be able to use these tags anyway. Still, not sure here either, what do you think?
A test corpus with only 3 ccorpora is available for testing on https://www.clarin.si/ske-beta/#dashboard?corpname=parlamint40_xx_en
One thing that doesn't work, and it is a big shame, is keywords over usas_glosses. I made a covid subcorpus and computed keywords agains the complete corpus over usas_cats, which works fine, and usas_glosses which returns no results. But the two attributes are isomorhpic, i.e. 1 usas_cat corresponds to 1 usas_full. I have no idea why it doesn't work, maybe I need to write to Lexical Computing...
Now I have discovered, that using / inside values is not good, because it is default noSketch separator.
Good point. Won't change it.
I like usas_glosses it is more understandable for me
OK, changed in ab73fda.
And, sorry, was working directly on main branch, will merge it into devel and switch.