kuhumcst/DanNet

Compatibility with goodmami/wn

hallundbaek opened this issue · 12 comments

I'd like to be able to query DanNet through github.com/goodmami/wn.

As far as I can see all it requires is an endpoint with an WN-LMF 1.1 XML file available somewhere on a stable url.

You might have luck converting the DanNet RDF dataset to WN-LMF using this tool by John McCrae: https://github.com/jmccrae/gwn-scala-api

DanNet is RDF in the Turtle serialisation.

I just tried to run it myself and unfortunately didn't have much luck: jmccrae/gwn-scala-api#23

For the record, I tried to run:

./gwn -i dannet.ttl -o dannet-lmf.xml -f RDF -t WNLMF --input-rdf-lang TURTLE

I gave that a go, and through changing some naming of the Lexicon to match this example, I got it to at least convert without errors.

But at that point the output was only the Lexicon but none of the data within it.

I also tried running the conversion on the english wordnet ttl, which could not convert either.

Since it is the same author, this lead me to conclude that gwn was probably deprecated, since dogfooding the english wordnet wasn't supported.

Which lead me to opening this issue, hoping that it would be a simple artifact for you to produce.

I started trying to implement a WN-LMF export this morning: https://github.com/kuhumcst/DanNet/tree/feature/136-wn-lmf

Just so you don't duplicate my efforts. I suspect it'll be done by the weekend. I'll let you know so that you can beta-test the WN-LMF file and then it'll be part of the official dataset releases from then on.

@hallundbaek Please let me know if this works: dannet-wn-lmf.zip

Great! I did not expect such a short turnaround on this, it is very much appreciated! Thanks!

I tried importing it using wn.add('path.to.file') but it did not initially parse. Apparently because goodmami/wn expects the first two lines to be exactly:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE LexicalResource SYSTEM "http://globalwordnet.github.io/schemas/WN-LMF-1.1.dtd">

or

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE LexicalResource SYSTEM "http://globalwordnet.github.io/schemas/WN-LMF-1.0.dtd">

The first line missing due to an improper parsing of xml on their part, and the second from a missing doctype declaration.

In any case, with those lines added it started importing the file.

Unfortunately it failed when importing the <Synset> elements, specifically when they had a child <SynsetRelation> with a target key referring to a synset identifier that did not have its own <Synset> element.

To fix that issue I compiled all ids referenced in target keys, and a second list of all values of <Synset>s id keys. I then found the difference of the two, highlighting exactly those targets that did not have a corresponding <Synset> element. Using that difference I removed all of the <SynsetRelation>s that referred to non-existing <Synset>s

At that point I got the xml file to get imported! And from a few queries through the python interface, it seems like it works!

I've uploaded the xml file that I got imported, alongside the list of synset ids that were only referenced in <SynsetRelation> targets.

dannet-goodmami-wn-compat.xml.gz
unrefd-syns.txt

Though ideally, the missing referenced <Synset>s would be nice to have! From testing a few of them on https://wordnet.dk/dannet/data/<id> they did show up, indicating that they somehow get lost on export to WN-LMF.

On another note: goodmami/wn does not support loading .zip files but only .gz files or raw .xml files, as such it would be preferable if the official release was either .gz or .xml.

Once it is available for release, it would be good to make a PR for goodmami/wn such that DanNet can be listed as an officially supported wordnet there, making it much easier to import while creating awareness of its compatibility. I'd be happy to offer my assistance if needed.

Thank you for that excellent feedback. I'll take a look at it next time I'm at work.

Great! I did not expect such a short turnaround on this, it is very much appreciated! Thanks!

No problem! Let's say that it's the data transformation magic of Clojure/LISP combined with the fact that most of my colleagues are away at various conferences.

The first line missing due to an improper parsing of xml on their part, and the second from a missing doctype declaration.

In any case, with those lines added it started importing the file.

Thanks, that's good to know.

Though ideally, the missing referenced s would be nice to have! From testing a few of them on https://wordnet.dk/dannet/data/ they did show up, indicating that they somehow get lost on export to WN-LMF.
I see.

Yeah, I'm not sure what's going on here. Let me investigate this further. It's probably some logic error in my SPARQL query.

On another note: goodmami/wn does not support loading .zip files but only .gz files or raw .xml files, as such it would be preferable if the official release was either .gz or .xml.

Sure... though it's a single file so decompressing it before use is surely not a huge obstacle? All of the datasets we have available for download are zipped as they would be much larger downloads otherwise.

Maybe it makes sense to make that file .gz. I'll have to think about it.

Once it is available for release, it would be good to make a PR for goodmami/wn such that DanNet can be listed as an officially supported wordnet there, making it much easier to import while creating awareness of its compatibility. I'd be happy to offer my assistance if needed.

Yes, definitely!

Try this one: dannet-wn-lmf.zip

Also, can you please share how you're loading these files in Python using the wn library? It would help me to debug on my end.

dannet-wn-lmf.zip

I've tested the following file and made sure that the XML file contained in it can be opened in goodmami/wn.

Also, can you please share how you're loading these files in Python using the wn library? It would help me to debug on my end.

Apologies for not getting back to you on this, but I'm happy you got it working.

I've also tried it and I can confirm it works! Thanks a bunch!

Sure... though it's a single file so decompressing it before use is surely not a huge obstacle? All of the datasets we have available for download are zipped as they would be much larger downloads otherwise.

Maybe it makes sense to make that file .gz. I'll have to think about it.

Yeah it wouldn't be a huge obstacle, but it would make it impossible to add to the index at goodmami/wn, since it does not support zip files. This would make it less discoverable through their documentation, in the worst case leading users to conclude that only the omw DanNet is supported, and in the best case they would have to dig further into the documentation, figure out you can import files, look up DanNet, download and unzip the zip file, and then load the file.

If zip is preferred for keeping the available download formats homogeneous, I would suggest maintaining both zip and a gz, just for compatibility with goodmami/wn.

@hallundbaek The gzip WN-LMF dataset is included in the latest release: https://github.com/kuhumcst/DanNet/releases/tag/v2024-06-12