freme-project/e-Entity

fremener/documents HTML entity not detected

Closed this issue · 7 comments

If I want to do NER on a example HTML the entity is not recognized. The example is

<!DOCTYPE html>
<html>
  <head></head>
  <body>
    <div>
       <p>This is Berlin</p>
    </div>
  </body>
</html>
m1ci commented

FREME NER was designed and trained on data without markups. We assume a client submits cleaned text without markup. This was announced since the beginning of the development of FREME NER and it is also considered by wripl.

But the API documentation says text/html as input is allowed and also Felix said in an email that e-Entity now supports html as input and output ...

m1ci commented

oh, yes, thats true, sorry for the noise. It now supports HTML via the e-Internationalization service.

There is no entity recognized since the text is very short. Note that FREME NER was trained on sentences with "normal" length and correctly formatted texts. Your text is short and not grammatically correct - no dot at the end.

If you add dot at the end, or extend the text, e.g. "This is Berlin." or "This is Berlin, a city in Germany." then the entities will be recognized.

Test cURL

curl -X POST --header "Content-Type: text/html" --header "Accept: text/turtle" -d @test.html "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&mode=all" -v

Where test.html is test.html

Please check and close if this solves the problem.

Ok, your example works. However, when I set the accept header to text/html I get

<html>
  <body>
    <h1>Whitelabel Error Page</h1>
    <p>This application has no explicit mapping for /error, so you are seeing this as a fallback</p>
    <div id='created'>Thu Oct 29 09:21:14 CET 2015</div>
    <div>There was an unexpected error (type=Internal Server Error, status=500).</div>
    <div>String index out of range: 701</div>
  </body>
</html>
m1ci commented

Can you share the concrete cURL request? Here is cURL example with Accept: text/turtle and it works to me.

curl -X POST --header "Content-Type: text/html" --header "Accept: text/html" -d @test.html "http://api-dev.freme-project.eu/current/e-entity/freme-ner/documents?language=en&dataset=dbpedia&mode=all" -v

It's the same call as you do, except I use this file.

m1ci commented

oh, this is then bug with the e-Internationalization service. Moving the issue to e-Internationalization repo freme-project/e-Internationalization#27