FREME NER Poor entities spotting in german
Closed this issue · 14 comments
I'm using Postman but this it will be the equivalent using cUrl:
curl -X POST --header "Content-Type: text/plain;charset=UTF-8" --header "Accept: application/ld+json" -d @e-entity-test.txt "http://api.freme-project.eu/current/e-entity/freme-ner/documents?language=de&dataset=dbpedia&mode=spot" > out-e-entity.txt
I get back only 2 entities
{ "@id" : "http://freme-project.eu/#char=1950,1960", "@type" : [ "nif:RFC5147String", "nif:String", "nif:Phrase", "nif:Word" ], "nif:anchorOf" : "Passiflora", "beginIndex" : "1950", "endIndex" : "1960", "referenceContext" : "http://freme-project.eu/#char=0,1999", "itsrdf:taConfidence" : 0.5918798367900452 }, { "@id" : "http://freme-project.eu/#char=783,793", "@type" : [ "nif:RFC5147String", "nif:String", "nif:Phrase", "nif:Word" ], "nif:anchorOf" : "Passiflora", "beginIndex" : "783", "endIndex" : "793", "referenceContext" : "http://freme-project.eu/#char=0,1999", "itsrdf:taConfidence" : 0.9796871663494048 }
The strange part.. if the language is set up to "en" I get much more entities.
I don't see any problem here. FREME NER successfully spotted three named entities - that is the name of the flower - Passiflora. In this text no other named entities occur except that named entity. I recommend that you represent textual documents as a set of named entities and terms. In this text, there is only one entity (Passiflora) but many other terms.
I have this text now
e-entity-test.txt witch is the content of the article from this blog http://www.gartenjournal.net/passionsblume-duengen.
By when this content was indexed I got ([{"tag":"Passiflora","score":"0.94"}]) as entity and ([{"term":"Passionsblume","score":"1.00"}]) as term only. One entity and one term.
Maybe more but I remove duplicates.
Obviously I don't know german but for the amount of words the text has I knew that something is not right here. So, using Postman I performed the calls manually using the exact same headers and the same parameters I'm using in my code.
cUrl for e-Entity
curl -X POST --header "Content-Type: text/plain;charset=UTF-8" --header "Accept: application/ld+json" -d @e-entity-test.txt "http://api.freme-project.eu/current/e-entity/freme-ner/documents?language=de&dataset=dbpedia&mode=spot&domain=TaaS-1100" | python -mjson.tool
cUrl for e-Terminology
curl -X POST --header "Content-Type: text/plain;charset=UTF-8" --header "Accept: application/ld+json" -d @e-entity-test.txt "http://api.freme-project.eu/current/e-terminology/tilde?target-lang=de&source-lang=de&mode=annotation&domain=TaaS-1100" | python -mjson.tool
I don't know about the quality but I have now more entities and more terms back.
Any hints why? My first thought is because the servers failed during the indexing. If this is the case I have a python script to get the content from the user website without ask him to resend the content. So I don't have to bother him.
Actually if I use language=en I get more entities even if the content is in german and with language=de parameter less entities. One only. Well two but the same entity "Passiflora"
I just checked again.
This must be a bug.
@Xfran this is more like an enhancement/feature request than a bug. e-Terminology is supposed to give back terms and phrases, while e-Entity is supposed give named entities only (usually proper nouns - names of persons/orgs/locations/books/things) and that's how the freme-ner NER models are trained. The German example text doesn't have many of that. The English spotter run on German text obviously gives you more results because of the capitalised words (that it incorrectly deems to be names.)
Google translate example:
I love to play soccer --> Ich liebe Fußball zu spielen
Ideally there would be a single tool with configurable granularity of entities. But as of now, why not use both e-Entity and e-Terminology and combine the results?
why not use both e-Entity and e-Terminology and combine the results?
@nilesh-c because we use terms for different things.
I tried Spanish which is one of the languages I can handle.
99% aprox. of the entities are the same using language=en and language=es as parameter.
The only thing is not the same is the confidence for each entity. Surprising some of the entities has higher confidence in english than spanish.
Checkout this files.
resp_es.txt
resp_en.txt
Use vimdiff to see the differences.
If this is not a bug maybe I should just use "en". I have to talk about this with @koidl .
Actually if I use language=en I get more entities even if the content is in german and with language=de parameter less entities. One only. Well two but the same entity "Passiflora"
I just checked again.
This must be a bug.
In the document only Passifolora is an entity and FREME NER with language=de
successfully spots it.
Selecting language=en
might spot other entities, in your case more entites. But those are incorrectly spotted. In German (I dont speak German), there can be nouns that start with capital letter - E.g. Ich liebe Fußball zu spielen. By setting language=en
and processing German text, will probably spot Fußball as entity since it is word that starts with capital letter. Note that capitalization is not the only feature which is considered when spotting entites, but also the overal sentence structure and the words before/after the entity candidate.
So, if you process English text set language=en
, German langauge=de
. If you try to trick FREME NER and set different language, this will probably lead to incorrectly spotted entities.
Ok. And what about Spanish then? I speak spanish. The entities are not bad.
Did you had a chance to take a look at those files?
I didn't checked the types though.
Ok. And what about Spanish then? I speak spanish. The entities are not bad.
I would not recommend tricking FREME NER. If you think it makes sense to process English texts with Spanish NER model (or the other way around), just go ahead, but I would not recommend doing that.
The English entity spotting model was trained on English texts. The Spanish on Spanish texts.
Maybe English and Spanish language have similar written structure, so Spanish entity spotting model might work OK on English texts.
If you try to trick FREME NER and set different language, this will probably lead to incorrectly spotted entities.
It will be an interested feature for FREME to automatically detect the content language and if is supported or not by FREME NER .
Maybe English and Spanish language have similar written structure.
Believe me, latin languages such Italian, Spanish, French, Romanian has nothing to do with the structure of English language or Germanic based languages in general.
Ok. If is not considered a bug I will close the issue.
It will be an interested feature for FREME to automatically detect the content language and if is supported or not by FREME NER .
OK, let me research if having auto detection of language can be integrated and what resources are available.
@Xfran the thing is, the German text you submitted does not have more than 1-2 named entities (assuming we agree on the narrow traditional definition of named entities here) - a German colleague confirmed for me.
@nilesh-c Just click on the Aylien link as it is after that hit the blue button "Analyse"
This guys spotted 3 times "Passiflora", "subtropischen", "August" as entities. Freme NER spotted "Passiflora" twice only. I use them when I have any doubts to compare spotted entities.
Sorry, I don't fully understand how spotting entities is working but I still don't get the spanish example than. I don't want to make a PhD out of this issue. I just want to know more or less why the differences if any and how this affects my application and our business case.
If this is not a bug than I must accept it. That's why I closed the issue.
Thank you for your interest @nilesh-c. I appreciate this.
Just a side note not related to this issue:
For the Freme documentation and an user friendly way to demonstrate how to use Freme services, Aylien it is a good inspiration. @fsasaki