UPOS "X"
nschneid opened this issue · 4 comments
The guidelines at https://universaldependencies.org/u/pos/X.html say it should be used very restrictively.
Setting aside the usage with goeswith
dependents, we have:
- In GUM, X corresponds to XPOS
FW
orLS
: https://universal.grew.fr/?custom=65184da9dbe01. There are also a fewFW
lexemes that are not X, mainly borrowed Latin abbreviations: https://universal.grew.fr/?custom=65184e2eded21- regarding
LS
, https://universaldependencies.org/u/dep/list.html says that list item numbers should be NUM
- regarding
- In EWT, X corresponds to a variety of things: mainly
FW
,ADD
- URLs and email addresses (would PROPN work for these?),GW
(mainly space-separated parts of filenames),NN
andNNP
within filenames, andAFX
affixes like "ex". SomeGW
parts of filenames have substantive UPOS, as do someFW
andAFX
words: https://universal.grew.fr/?custom=651850b455da1
GUM XPOS doesn't use ADD
or AFX
(these are more recent additions to the PTB tagset). But I see internet addresses under PROPN in GUM, which makes sense linguistically.
I think steps here are:
- Harmonize treatment of
LS
list markers - Map EWT
ADD
to PROPN instead of X, and move guidelines examples from SYM (UniversalDependencies/docs#973) - Review separated affixes and assign a POS based on the kind of modification, typically ADJ or ADV (#152)
- Come up with a coherent EWT policy for filenames (e.g.
flat
orgoeswith
, and what to do about transparent syntax within parts of filenames) (UniversalDependencies/docs#666) - Clarify UPOS policy for flat:foreign structures (maybe individual words should be
X
and there should be anExtPos
)- UniversalDependencies/docs#1001 clarifies the policy; decided not to use the subtype for English. Need to check whether the policy is implemented consistently.
list item numbers should be NUM
This is definitely not right, because LS is also the tag for graphical bullets, which are in no way numbers. I'm also not sure that "A1.iii)" is a number, I'd say it's much more of an X
. I see some mention of using either PUNCT/punct or SYM/dep for these. In GUM xpos=LS is always attached as dep
, and nummod
is only used for counting things.
This is definitely not right, because LS is also the tag for graphical bullets, which are in no way numbers.
https://universaldependencies.org/u/pos/SYM.html says bullets are PUNCT. It seems to be distinguishing them from list item markers with a (quasi)numerical component (i.e., they reflect a position in a sequential ordering of some kind).
I could also imagine thinking of lists as a type of coordination, and these as helping to mark how a list item relates to other items in the list, so CCONJ. But that may be unpopular. :)
I'm not so convinced. I think syntactically there is no difference between numerical, graphical, alphabetical and mixed list item markers. It's all the same kind of orthographic device, and I would like them to have the same analysis. I wouldn't feel too bad about punct, but then we are not allowed to treat them as kinds of numbers morphologically, and in any case it would create an uncomfortable situation where punctuation becomes open ended.
Tagging them all as SYM, or even splitting them into SYM for non-numerical and NUM for numerical would be OK for me too, but I think they should have the same deprel regardless of what kind of list item marker they are.