hdaSprachtechnologie/odenet

Wrong partofSpeech

Closed this issue · 6 comments

In German, nouns begin with a capital letter. In OdeNet, more than 1800 entries have partOfSpeech "n" and are spelled in lowercase.

POS_errors_n.txt
This is a list of these items.

If I get a list of synsets and correct POS, I can automatically correct all of them.

I am working on it.

Wrong partOfSpeech Proposal 1.xlsx

I have manually checked the entries and synsets. Since I am not sure how to handle multiword expressions containing nouns, my proposal only deals with entries and synsets which are written in lowercase and do not include words in proper case.

A column "Merge with" marks entries which should be merged to an existing entry with this PoS.

Recently emptied synsets.xlsx

The file contains 52 synsets that have been left behind without any links to entries by one of the recent commits, and their former entries are orphaned as well.

I have to correct myself. The entries still point to the synsets in question (and I am glad they do).

Some of them are OK as nouns (proper nouns), even though they are spelled in lower case:

  • odenet-14664-n die tageszeitung, taz
  • odenet-4541-n dpa, Deutsche Presse-Agentur

Nouns and their abbreviation in lower case:

  • odenet-19902-n Reibwert, μ
  • odenet-28457-n Kubikdezimeter, dm³

Synsets like these should be coded as verbs IMO:

  • odenet-28760-n füßeln, sich mit den Füßen berühren
  • odenet-33424-n steigen, teurer werden, im Preis steigen, sich verteuern
  • odenet-36014-n zu weit schlagen, ins Aus schlagen, verschlagen

odenet-28760-n is an example of a pattern that is quite frequent in openthesaurus: an entry (“füßeln”) accompanied by one or more synonymous phrases (paraphrases, multiword expressions).

Here is another example:

  • odenet-32109-n Tesafilm, durchsichtiges Klebeband, tesa

Compare this entry in Open English Wordnet (https://en-word.net/lemma/Sellotape)

oewn-02996250-n (Interlingual Index: i51697)
(n) cellulose tape, Scotch tape, Sellotape
Definition: transparent or semitransparent adhesive tape

In Open English Wordnet a multiword expression cannot be an entry unless its meaning is different from its literal meaning. If we adopt this rule phrases like ("durchsichtiges Klebeband", "sich mit den Füßen berühren") would have to be deleted.

Any opinions on that?