hdaSprachtechnologie/odenet

Excessive numbers of hypernyms, ILI reuse

Closed this issue · 17 comments

I'm investigating an issue (goodmami/wn#88) where the Wn library is very slow to enumerate the hypernym paths of an OdeNet synset. Here's an example:

>>> import wn
>>> de = wn.Wordnet(lexicon='odenet')
>>> de.synsets('Fotografie')  # this looks fine
[Synset('odenet-4634-n'), Synset('odenet-23014-n')]
>>> foto = de.synsets('Fotografie')[0]
>>> foto.hypernym_paths()  # this takes a long time; I killed it before it finished

While there are things I should do to mitigate the issue through coding improvements, the data also looks suspect:

>>> len(foto.hypernyms())
20
Click here to see the 20 hypernyms
>>> for hyp in foto.hypernyms():
...   print(hyp, hyp.lemmas())
... 
Synset('odenet-10049-n') ['Darstellung', 'Abbildung', 'Illustration', 'Spiegelbild', 'Visualisierung', 'Abbild', 'Wiedergabe']
Synset('odenet-10060-n') ['Abzug', 'Aufbruch', 'Start', 'Abreise', 'Fahrtbeginn', 'Abfahrt']
Synset('odenet-4405-n') ['Erkenntnis', 'Verständnis', 'Klarsicht', 'Vergegenwärtigung', 'Intellekt', 'Erleuchtung', 'Bewusstsein']
Synset('odenet-27909-n') ['Anbeginn', 'Geburtsstunde', 'früheste Anfänge', 'erste Anfänge', 'erster Anfang']
Synset('odenet-33753-v') ['auslösen', 'anregen', 'nachfragen', 'einen Anstoß geben', 'antriggern', 'triggern', 'anfragen', 'einen Impuls geben']
Synset('odenet-18486-n') ['Abzug', 'Entlüfter', 'Klimaanlage', 'Dunstabzug']
Synset('odenet-8673-n') ['Abzug', 'Subtraktion']
Synset('odenet-5225-n') ['Schaufensterauslage', 'Auslage']
Synset('odenet-657-n') ['Abzug', 'Abgaskanal', 'Esse', 'Schornstein', 'Rauchfang', 'Schlot', 'Abzugsrohr']
Synset('odenet-19436-n') ['Auslöser', 'Signalreiz', 'Schlüsselreiz', 'Wahrnehmungssignal']
Synset('odenet-7443-n') ['Auslöser', 'Steuerelement', 'Stellglied']
Synset('odenet-7230-n') ['Auslöser', 'Triggerereignis', 'Auslösemechanismus', 'Trigger']
Synset('odenet-27908-n') ['Beginn', 'Anbruch', 'Start', 'Anfang']
Synset('odenet-4634-n') ['Bild', 'Abzug', 'Bildnis', 'Vergrößerung', 'Abbildung', 'Aufnahme', 'Bildwerk', 'Positiv', 'Ablichtung', 'Ausbelichtung', 'Fotografie', 'Lichtbild', 'Foto', 'Photo']
Synset('odenet-10372-n') ['Auslöser', 'Zündvorrichtung', 'Zünder']
Synset('odenet-27000-n') ['Visualisierung', 'Veranschaulichung']
Synset('odenet-20193-n') ['Hebebaum', 'Wuchtbaum']
Synset('odenet-4422-n') ['Geburt', 'Eröffnung', 'Kick-off-Veranstaltung', 'An...', 'Einstiegs...', 'Auftakt...', 'Anbeginn', 'Anfangs...', 'Auftakt', 'erste Schritt', 'Einsteiger...', 'Startschuss', 'Einstieg', 'Erst...', 'Antritts...', 'Anspiel', 'Take-off', 'Takeoff', 'Aufgalopp', 'Start...', 'Aufbruch', 'Eröffnungs...']
Synset('odenet-26999-n') ['Datenvisualisierung', 'Informationsvisualisierung']
Synset('odenet-11520-n') ['Gerät', 'Anlage', 'Apparatur', 'Maschine', 'Apparat', 'Aggregat', 'Automat']

While we might expect 20 or more hyponyms, it's strange to see this many hypernyms. In contrast, the highest number of hypernyms for a synset in the EWN is much lower:

>>> ewn = wn.Wordnet(lexicon='ewn:2020')
>>> max(len(ss.hypernyms()) for ss in ewn.synsets())
6

Also, there should generally be just one synset with a given ILI in a lexicon. If there are more, they should be merged into the same synset. We don't see that in OdeNet:

>>> foto.ili
'i54514'
>>> len(de.synsets(ili=foto.ili))
10
>>> len(ewn.synsets(ili=foto.ili))
1

Except for foto (odenet-4634-n), the other 9 are a subset of the 20 hypernyms listed above. Something should not have the same ILI as its hypernyms. Also the definitions of the synset and the corresponding EWN synset are different:

>>> ewn.synsets(ili=foto.ili)[0].definition()
'lever that activates the firing mechanism of a gun'
>>> foto.definition()
'Mit einem Fotoapparat gemachte Abbildung.'

I suspect part of these differences have to do with the automatic way the data was created, but some things (like the high number of hypernyms and ILI sharing) should be validated against, I would think.

This might be informative. It is the distribution of shared ILIs. The first column is the number of unique ILIs such that the second column is the number of synsets sharing an ILI. The outlier (1 ILI shared by 16,376 synsets) is when ili="". For example, the second line means that there are 6 ILIs that are each associated with 10 different synsets (i54514 mentioned above is one of these).

$ grep -o 'ili="[^"]*"' deWordNet.xml | sort | uniq -c | sed 's/ *\([0-9]\+\).*/\1/' | sort | uniq -c | sort -hk1
      1 16376
      6 10
      8 9
     10 8
     27 7
     45 6
    134 5
    320 4
    741 3
   2412 2
  10115 1

Thanks @goodmami. I agree that these are undesirable, and we should check for them in the odenet build process. OMW does check them, but we did not upload the latest version, ...

Because there are too many to check by hand quickly, maybe it is worth picking one semi-automatically (maybe the hypernym if there is a hyponym-hypernym pair) and setting ili='' for the others (with maybe a comment giving the ili so it can be checked later. @hdaSprachtechnologie, what do you think?

Thank you @goodmami !
I'd need some more information about the hypernyms:
The entry for (odenet-4634-n) mentions only one hypernym:

Mit einem Fotoapparat gemachte Abbildung.

This is the hypernym to ['Darstellung', 'Abbildung', 'Illustration', 'Spiegelbild', 'Visualisierung', 'Abbild', 'Wiedergabe'], which is fine.
There is one more synset that has a hyponym relation to (odenet-4634-n): odenet-20193-n. This is a wrong relation and should be deleted.
Where do all the other relations come from that you mention?

The duplicated ili relations is a different problem, I think. I have implemented a test for duplicated ili's and also found that there are many. The reason is the usage of automatic translation of synset lemmas. But I have no good idea how to solve this problem.

For automatically setting multiple ili links to "", I would like to suggest not to do this, when the entry has a confidenceScore="1.0", because these entries are manually checked and expanded already.

Where do all the other relations come from that you mention?

As I dig further, the hypernym issue is due to a bug in Wn where the ILI-based search for the interlingual setting was being overappliied to the monolingual setting. The problem in odenet is originally the shared ILI issue. That is, if you consider all the synsets that share the same ILI as odenet-4634-n, they collectively have 20 hypernyms.

For automatically setting multiple ili links to "", I would like to suggest not to do this, when the entry has a confidenceScore="1.0", because these entries are manually checked and expanded already.

I agree. There are 682 synsets with a non-empty ILI and condidenceScore="1.0", and there don't appear to be any shared ILIs within them.

I agree. There are 682 synsets with a non-empty ILI and condidenceScore="1.0", and there don't appear to be any shared ILIs within them.

Actually I was incorrect; I didn't sort the results properly and missed some. Here are the numbers:

$ grep 'confidenceScore="1.0"' deWordNet.xml | grep -o 'ili="[^"]*"' | sort | uniq -c | sed 's/ *\([0-9]\+\).*/\1/' | sort -n | uniq -c
    659 1
     20 2
      2 3
      1 172

That is, among those with confidenceScore="1.0", 659 unique ILIs are linked 1-to-1 to synsets, 20 are linked to 2 synsets (i.e., 40 synsets total), 2 ILIs are linked to 3 synsets, and the empty ILI (ili="") appears 172 times (concepts not yet in ILI, rather than simply unlinked, I suppose?).

Those 22 shared ILIs deserve extra scrutiny but I'm happy to not change the 659 others. Here are the synsets, definitions, and lemmas for the 2 ILIs with 3 synsets:

>>> import wn
>>> de = wn.Wordnet(lang='de', expand='')
>>> for ss in de.synsets(ili='i80424'):
...   print(f'{ss}\n  {ss.definition()}\n  {ss.lemmas()}')
... 
Synset('odenet-26852-n')
  inoffizielle Vereinigung von Personen
  ['Gruppe', 'Schar', 'Kreis', 'Grüppchen', 'Trupp', 'Runde', 'Pulk', 'Versammlung', 'Kränzchen']
Synset('odenet-7995-n')
  inoffizielle Vereinigung von Menschen oder Gruppen
  ['Gruppe', 'Gesellschaft', 'Körperschaft']
Synset('odenet-9865-n')
  Zusammenschluss von mehreren Personen zur Lösung einer bestimmten Aufgabe oder zur Erreichung eines bestimmten Zieles
  ['Gruppe', 'Kollektiv', 'Team']
>>> 
>>> for ss in de.synsets(ili='i89349'):
...   print(f'{ss}\n  {ss.definition()}\n  {ss.lemmas()}')
... 
Synset('odenet-2272-n')
  jemand, der für Waren oder Services bezahlt
  ['Kunde', 'Auftraggeber', 'Besteller', 'Kundin']
Synset('odenet-2688-n')
  Jemand, der für Waren oder Dienstleistungen bezahlt.
  ['Abnehmer', 'Verbraucher', 'Konsument']
Synset('odenet-8875-n')
  Person, Gruppe oder Institution, die das von einem anderen Gelieferte oder Produzierte gegen eine Gegenleistung annimmt
  ['Kunde', 'Abnehmer', 'Bezieher', 'Käufer', 'Kundin']

Do these look like they could be combined?

I am planning to implement a semi-manual process for changing these. But I will only be able to work on it after the 12th February.
The idea is that the user gets a duplicated ili, all the synsets, and the English synset that is connected with it. Then s/he can choose which synset keeps the ili link. All the other ilis will then be automatically be set to "".
We might also want to look at the POS information. I assume that the link is better, when the synset has the same POS as the English synset.

I am planning to implement a semi-manual process for changing these. But I will only be able to work on it after the 12th February.

No worries. I'm not in any particular rush.

We might also want to look at the POS information. I assume that the link is better, when the synset has the same POS as the English synset.

Like Francis said, this is most likely good thing, but we still need to take care as word categories like POS don't always work the same across languages. For example, some English adjectives (like dry) are best translated as verbs in Japanese (乾く "to become dry", etc.). I imagine there'd be similar category shifts between English and German.

Furthermore, it's not a big problem in the hand-checked subset of OdeNet. Among the (non-empty) shared ILIs where confidenceScore="1.0", there is only one where the synsets differ in POS:

>>> import wn
>>> de = wn.Wordnet(lang='de', expand='')
>>> ilimap = {}
>>> for ss in de.synsets():
...     if ss.ili and ss.metadata().get('confidence') == 1.0:
...         if ss.ili in ilimap and ss.pos != ilimap[ss.ili].pos:
...             print(ss.ili, ss, ilimap[ss.ili])
...         else:
...             ilimap[ss.ili] = ss
... 
i35062 Synset('odenet-16451-a') Synset('odenet-10050-v')

Here are the words, but I don't speak German so I cannot judge if they are the same concept or not:

>>> de.synset('odenet-16451-a').lemmas()
['genug', 'Das reicht', 'Ende der Durchsage', 'Aus die Maus.', 'Halt ein', '(und damit) basta', 'Aufhören', '(dann ist bei mir) Feierabend', 'Ende, aus, Nikolaus.', 'Jetzt ist Sense', 'Schluss mit lustig', 'Ende und aus', 'es reicht', 'Klappe zu, Affe tot.', 'Kein Kommentar', 'Rien ne va plus.', 'genug davon', 'Punktum', 'genug jetzt', 'Mehr habe ich dem nicht hinzuzufügen.', 'genug damit', 'genug ist genug', 'Ende, aus, Mickymaus.', 'Ende im Gelände', 'Thema durch.', "und damit hat sich's", 'Es langt', 'Schluss, aus, Ende', "Jetzt reicht's", 'es reicht', 'aus']
>>> de.synset('odenet-10050-v').lemmas()
['aufgeben', 'beenden', 'ad acta legen', 'vergessen', 'stoppen', 'zu Grabe tragen', 'beerdigen', 'einstampfen', 'über Bord werfen', '(davon) Abschied nehmen', 'sich verabschieden von', 'fallenlassen', 'sausen lassen', 'sich abwenden von', 'Schlussstrich ziehen', 'fallen lassen', 'sterben lassen', 'den Rücken kehren', 'ablassen von', 'hinter sich lassen', 'an den Nagel hängen']

To compare, English has a verbal synset for this ILI:

>>> ewn.synsets(ili='i35062')
[Synset('ewn-02686624-v')]
>>> ewn.synsets(ili='i35062')[0].lemmas()
['quit', 'lay off', 'give up', 'stop', 'cease', 'discontinue']

I hope this helps!

Could you possibly give me a list of the 22?

Here are the synset IDs, ILIs, and parts of speech:

$ grep 'confidenceScore="1.0"' deWordNet.xml \
  | grep -o 'id="[^"]\+" ili="[^"]\+" partOfSpeech="."' \
  | sort -k2 -V \
  | uniq -f1 -w15 --all-repeated=separate
id="odenet-3713-a" ili="i4151" partOfSpeech="a"
id="odenet-8844-a" ili="i4151" partOfSpeech="a"

id="odenet-10399-a" ili="i5669" partOfSpeech="a"
id="odenet-142-a" ili="i5669" partOfSpeech="a"

id="odenet-14736-a" ili="i8857" partOfSpeech="a"
id="odenet-24456-a" ili="i8857" partOfSpeech="a"

id="odenet-36192-a" ili="i12152" partOfSpeech="a"
id="odenet-362280-a" ili="i12152" partOfSpeech="a"

id="odenet-362257-a" ili="i12201" partOfSpeech="a"
id="odenet-362258-a" ili="i12201" partOfSpeech="a"

id="odenet-3843-a" ili="i13844" partOfSpeech="a"
id="odenet-6352-a" ili="i13844" partOfSpeech="a"

id="odenet-1470-a" ili="i15257" partOfSpeech="a"
id="odenet-9242-a" ili="i15257" partOfSpeech="a"

id="odenet-354-v" ili="i26035" partOfSpeech="v"
id="odenet-4312-v" ili="i26035" partOfSpeech="v"

id="odenet-16451-a" ili="i35062" partOfSpeech="a"
id="odenet-10050-v" ili="i35062" partOfSpeech="v"

id="odenet-4330-n" ili="i35594" partOfSpeech="n"
id="odenet-688-n" ili="i35594" partOfSpeech="n"

id="odenet-12841-n" ili="i39275" partOfSpeech="n"
id="odenet-8132-n" ili="i39275" partOfSpeech="n"

id="odenet-15009-n" ili="i69097" partOfSpeech="n"
id="odenet-2806-n" ili="i69097" partOfSpeech="n"

id="odenet-4777-n" ili="i71948" partOfSpeech="n"
id="odenet-4850-n" ili="i71948" partOfSpeech="n"

id="odenet-16895-n" ili="i74480" partOfSpeech="n"
id="odenet-26646-n" ili="i74480" partOfSpeech="n"

id="odenet-20796-n" ili="i80120" partOfSpeech="n"
id="odenet-4250-n" ili="i80120" partOfSpeech="n"

id="odenet-26852-n" ili="i80424" partOfSpeech="n"
id="odenet-7995-n" ili="i80424" partOfSpeech="n"
id="odenet-9865-n" ili="i80424" partOfSpeech="n"

id="odenet-10454-n" ili="i81138" partOfSpeech="n"
id="odenet-12371-n" ili="i81138" partOfSpeech="n"

id="odenet-1768-n" ili="i89184" partOfSpeech="n"
id="odenet-18-n" ili="i89184" partOfSpeech="n"

id="odenet-2272-n" ili="i89349" partOfSpeech="n"
id="odenet-2688-n" ili="i89349" partOfSpeech="n"
id="odenet-8875-n" ili="i89349" partOfSpeech="n"

id="odenet-10537-n" ili="i93450" partOfSpeech="n"
id="odenet-3678-n" ili="i93450" partOfSpeech="n"

id="odenet-15939-n" ili="i110170" partOfSpeech="n"
id="odenet-31257-n" ili="i110170" partOfSpeech="n"

id="odenet-313-n" ili="i112780" partOfSpeech="n"
id="odenet-6987-n" ili="i112780" partOfSpeech="n"

Breaking that down:

  • grep 'confidenceScore="1.0"' deWordNet.xml -- find hand--checked lines
  • grep -o 'id="[^"]\+" ili="[^"]\+" partOfSpeech="."' -- only print synset ID, ILI (non-empty only), and POS
  • sort -k2 -V -- sort by ILIs (2nd column) numerically
  • uniq -f1 -w15 --all-repeated=separate -- print duplicate lines (ignore the first field (ID), only compare up to 15 characters, separate duplicate groups with a blank line)

I have solved these now, at least. There doesn't seem to be a good automatic solution, as there were different reasons for the duplication of ili.

Great, once you commit and push the changes I can test it out. Also, these are just the hand-checked tip of the iceberg. You can get rid of the first grep command to see the full list of ~3700 synset groups sharing an ILI:

$ grep -o 'id="[^"]\+" ili="[^"]\+" partOfSpeech="."' deWordNet.xml \
  | sort -k2 -V \
  | uniq -f1 -w15 --all-repeated=separate

You can use this to create a worklist. To illustrate, here are the first 4 groups:

id="odenet-390-a" ili="i29" partOfSpeech="a"
id="odenet-12054-n" ili="i29" partOfSpeech="n"

id="odenet-22144-a" ili="i64" partOfSpeech="a"
id="odenet-70-a" ili="i64" partOfSpeech="a"
id="odenet-7271-a" ili="i64" partOfSpeech="a"
id="odenet-915-a" ili="i64" partOfSpeech="a"

id="odenet-10242-a" ili="i74" partOfSpeech="a"
id="odenet-25583-a" ili="i74" partOfSpeech="a"

id="odenet-18462-a" ili="i147" partOfSpeech="a"
id="odenet-5769-a" ili="i147" partOfSpeech="a"

It took me two hours to work on the 22 (already committed and pushed). There is no chance that I can resolve the other 3700 manually in the near future.

already committed and pushed

Oh, my mistake. I was pulling from my fork. I see the commit now. Thanks!

I tried adding the latest version and I get integrity errors with the synset relations. That is, some synset relations are targeting synsets that no longer exist. Here's how I found them using Wn:

>>> import wn.lmf
>>> odenet = wn.lmf.load('../odenet/odenet/wordnet/deWordNet.xml')[0]
>>> ssids = {ss.id for ss in odenet.synsets}
>>> reltgts = {rel.target for ss in odenet.synsets for rel in ss.relations}
>>> reltgts - ssids  # set difference of synset relation targets and synset IDs
{'odenet-4626-n', 'odenet-25670-n'}

And here they are:

$ grep -C1 'target=.odenet-4626-n.' deWordNet.xml
<Synset id="odenet-362423-n" ili="" partOfSpeech="n">    
	<SynsetRelation target='odenet-4626-n' relType='hypernym'/>
</Synset>
$ grep -C1 'target=.odenet-25670-n.' deWordNet.xml
<Synset id="odenet-362432-n" ili="" partOfSpeech="n">    
	<SynsetRelation target='odenet-25670-n' relType='hypernym'/>
</Synset>
$ grep 'id=.odenet-4626-n.' deWordNet.xml  # confirm there's no synset for these
$ grep 'id=.odenet-25670-n.' deWordNet.xml

Maybe those targets need to point to the other synset that was sharing the ILI before?

There is no chance that I can resolve the other 3700 manually in the near future.

No, of course not. I didn't expect you to manually check them. I provided the command to generate the list that can be used for some future effort, perhaps one that is partially automated or crowd-sourced. Maybe the kinds of errors you noticed while fixing those 22 will help inform the next stage.

I have now deleted the relations to non-existing synsets and added the test into the validation process before submission.

We have now a new version that contains no duplicated ilis any more.