ulf1/treesimi

Can't run demo "Jaccard Similarity between Dependency Trees"

Closed this issue · 2 comments

The cell where the minhashes are created throws an ValueError

Bildschirmfoto von 2021-11-12 09-55-34

The error seems to occur for sentences with contractions of article and preposition (e.g. 'im', 'zur'). The conll-data has extra rows for the contracted form and the underlying isolated forms, thus the index of a token becomes for example (13, '-', 14) (see traceback).

Maybe you can just skip the row with the contracted form?

ulf1 commented

I see

...
((13, '-', 14), None, '_', 'im'),
 (13, 16, 'case', 'in'),
 (14, 16, 'det', 'dem'),
...

(13, '-', 14) is actually 13-14
and additional information that we don't need at this stage.

https://universaldependencies.org/format.html#words-tokens-and-empty-nodes