MartinoMensio/spacy-dbpedia-spotlight

400 Client Error for local model

Kittyuzu1207 opened this issue · 9 comments

Hi! Thanks for your great contributions!
When I deployed the model locally and used it in this way:
db_nlp = spacy.blank('en') db_nlp.add_pipe('dbpedia_spotlight', config={'dbpedia_rest_endpoint': 'http://localhost:2222/rest', 'confidence':0.35})
I still met an error which mentioned below when I request frequently:
2022-10-20 07:32:23.194 | WARNING | spacy_dbpedia_spotlight.entity_linker:get_remote_response:239 - Bad response from server, probably too many requests. Consider using your own endpoint. Document not updated. 400 Client Error: Bad Request for url: http://localhost:2222/rest/annotate

Is it normal, or did I get anything wrong? Look forward to your help~

acxcv commented

Hi, I was facing the same problem. I don't know how to fix it and I didn't have enough capacity to dig deeper, but on a medium-sized dataset (36k texts) I was able to work around it using a monkey patch

# nlp is the spacy object as described in the examples
def nlp_wrapper(text):
        try:
            return nlp(text)
        except Exception as e:
            print(e)
            try:
                except_nlp = spacy.blank('en')
                except_nlp.add_pipe(
                    'dbpedia_spotlight', config={'confidence': confidence,
            })
                return except_nlp(text)
            except:
                return '[[ERROR]]'

With this snippet, my script processed as many texts as possible on the local server and fell back to the standard configuration using the web API.
In case the call to the web API returned an error, too, it returned an error token that let me process the texts that produced an error later on in a second iteration.

On a side note:
I noticed that the behavior of the nlp object changes depending on the value for 'process' in the config (nlp._pipe_configs['dbpedia_spotlight']) . In my case, with 'annotate' it failed after processing ca. 9500 text whereas with 'candidates', it failed after ca. 1500 texts.

Hi @acxcv and @Kittyuzu1207 ,
Thank you for reporting this behaviour!
Can you manage to capture one document which makes the local server fail and instead still works with the remote API?
I would like to reproduce it and analyse what goes wrong, so if you could provide this example, I can look into it.

Given that you already have the looping and the way to handle the error, you could add some lines to capture these specific documents.

I suspect this may be an encoding error or some character that makes something break, while on the remote API maybe they have some way to clean/escape the text.

Thank you again!

Martino

acxcv commented

Hi Martino, this is where it gets weird.

The texts in question produce an error in the loop but don't do so when processing them outside the loop. Also, some texts that don't produce an error with 'config': 'annotate' do so with 'config': 'candidates'.

The dataset I'm working with is a private one. Let me verify if I can share it and I'll get back to you.

Hi @acxcv ,
Oh, then maybe there is an issue with firing too many requests in a short time. Could you check the logs from the local server? What if, on exception triggered, you wait for 1 second and retry with the same document?
Just to understand better the type of problem.

Martino

acxcv commented

For transparency: I shared the dataset in the meantime.

Additionally, I was able to come up with an isolated test case where the same error occurs. This does not explain why certain texts could not be processed within a loop but worked fine outside a loop in the situation described by me above. However, it looks like a related problem since it triggers the same error.

In a Python shell, I set up the spotlight pipeline as usual:

import spacy
nlp = spacy.blank('en')
nlp.add_pipe('dbpedia_spotlight', config={
        'dbpedia_rest_endpoint': 'http://localhost:2222/rest'})        

A sample text from the Wikipedia page about Barcelona:

text = 'Barcelona is a city on the coast of northeastern Spain. It is the capital and largest city of the autonomous community of Catalonia, as well as the second most populous municipality of Spain. With a population of 1.6 million within city limits, its urban area extends to numerous neighbouring municipalities within the Province of Barcelona and is home to around 4.8 million people, making it the fifth most populous urban area in the European Union after Paris, the Ruhr area, Madrid, and Milan. It is one of the largest metropolises on the Mediterranean Sea, located on the coast between the mouths of the rivers Llobregat and Besòs, and bounded to the west by the Serra de Collserola mountain range, the tallest peak of which is 512 metres (1,680 feet) high. Founded as a Roman city, in the Middle Ages Barcelona became the capital of the County of Barcelona. After joining with the Kingdom of Aragon to form the confederation of the Crown of Aragon, Barcelona, which continued to be the capital of the Principality of Catalonia, became the most important city in the Crown of Aragon and the main economic and administrative centre of the Crown, only to be overtaken by Valencia, wrested from Arab domination by the Catalans, shortly before the dynastic union between the Crown of Castile and the Crown of Aragon in 1492. Barcelona has a rich cultural heritage and is today an important cultural centre and a major tourist destination. Particularly renowned are the architectural works of Antoni Gaudí and Lluís Domènech i Montaner, which have been designated UNESCO World Heritage Sites. The city is home to two of the most prestigious universities in Spain: the University of Barcelona and Pompeu Fabra University. The headquarters of the Union for the Mediterranean are located in Barcelona. The city is known for hosting the 1992 Summer Olympics as well as world-class conferences and expositions and also many international sport tournaments.'

Now, when I call nlp(text) I receive the error described above:

>>> nlp(text)
2023-02-02 14:56:01.000 | WARNING  | spacy_dbpedia_spotlight.entity_linker:get_remote_response:239 - Bad response from server, probably too many requests. Consider using your own endpoint. Document not updated.
	400 Client Error: Bad Request for url:http://localhost:2222/rest/annotate
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "[...]/tenv/lib/python3.7/site-packages/spacy/language.py", line 1031, in __call__ 
		error_handler(name, proc, [doc], e)
File "[...]/tenv/lib/python3.7/site-packages/spacy/util.py", line 1670, in raise_error
		raise e
File "[...]/tenv/lib/python3.7/site-packages/spacy/language.py", line 1026, in __call__
		doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
File "[...]/tenv/lib/python3.7/site-packages/spacy_dbpedia_spotlight/entity_linker.py", line 266, in __call__
		data = self.get_remote_response(doc)
File "[...]/tenv/lib/python3.7/site-packages/spacy_dbpedia_spotlight/entity_linker.py", line 242, in get_remote_response
		raise e
File "[...]/tenv/lib/python3.7/site-packages/spacy_dbpedia_spotlight/entity_linker.py", line 234, in get_remote_response
		response.raise_for_status()
File "[...]/tenv/lib/python3.7/site-packages/requests/models.py", line 1021, in raise_for_status
		raise HTTPError(http_error_msg, response=self
	requests.exceptions.HTTPError: 400 Client Error: Bad Request for url: http://localhost:2222/rest/annotate

I'm running the Spotlight server via Java without Docker:

java -Xms32g -jar dbpedia-spotlight-1.0.0.jar en http://localhost:2222/rest

This is the corresponding log:

124 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Loading MemoryQuantizedCountStore...
228 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Done (104 ms)
230 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Loading MemoryTokenTypeStore...
1839 [main] INFO org.dbpedia.spotlight.db.memory.MemoryTokenTypeStore - Creating reverse-lookup for Tokens.
2320 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Done (2089 ms)
2321 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Loading MemorySurfaceFormStore...
15662 [main] INFO org.dbpedia.spotlight.db.memory.MemorySurfaceFormStore - Summing total SF counts.
16160 [main] INFO org.dbpedia.spotlight.db.memory.MemorySurfaceFormStore - Creating reverse-lookup for surface forms, adding normalized surface forms.
17305 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Done (14983 ms)
17305 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Loading MemoryResourceStore...
24681 [main] INFO org.dbpedia.spotlight.db.memory.MemoryResourceStore - Creating reverse-lookup for DBpedia resources.
25405 [main] INFO org.dbpedia.spotlight.db.memory.MemoryResourceStore - Counting total support...
25599 [main] INFO org.dbpedia.spotlight.db.memory.MemoryResourceStore - Done.
25599 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Done (8294 ms)
25600 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Loading MemoryCandidateMapStore...
38812 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Done (13211 ms)
38813 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Loading MemoryContextStore...
65526 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Done (26712 ms)
70511 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Loading FSADictionary...
70884 [main] INFO org.dbpedia.spotlight.db.memory.MemoryStore$ - Done (373 ms)
71014 [main] INFO org.dbpedia.spotlight.web.rest.Server - Initiated 1 disambiguators.
71014 [main] INFO org.dbpedia.spotlight.web.rest.Server - Initiated 2 spotters.
Feb 02, 2023 2:55:32 PM com.sun.grizzly.Controller logVersion
INFORMATION: GRIZZLY0001: Starting Grizzly Framework 1.9.48 - 02.02.23 14:55
Server started in [...]/spotlight-java listening on http://localhost:2222/rest
Feb 02, 2023 2:55:52 PM com.sun.jersey.api.core.PackagesResourceConfig init
INFORMATION: Scanning for root resource and provider classes in the packages:
  org.dbpedia.spotlight.web.rest.resources
Feb 02, 2023 2:55:55 PM com.sun.jersey.api.core.ScanningResourceConfig logClasses
INFORMATION: Root resource classes found:
  class org.dbpedia.spotlight.web.rest.resources.Annotate
  class org.dbpedia.spotlight.web.rest.resources.Disambiguate
  class org.dbpedia.spotlight.web.rest.resources.Spot
  class org.dbpedia.spotlight.web.rest.resources.Feedback
  class org.dbpedia.spotlight.web.rest.resources.Candidates
Feb 02, 2023 2:55:55 PM com.sun.jersey.api.core.ScanningResourceConfig init
INFORMATION: No provider classes found.
Feb 02, 2023 2:56:00 PM com.sun.jersey.server.impl.application.WebApplicationImpl _initiate
INFORMATION: Initiating Jersey application, version 'Jersey: 1.19.3 10/24/2016 03:58 PM'
Feb 02, 2023 2:56:00 PM com.sun.jersey.api.wadl.config.WadlGeneratorLoader loadWadlGenerator
INFORMATION: Loading wadlGenerator org.dbpedia.spotlight.web.rest.wadl.ExternalUriWadlGenerator
99322 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - ******************************** Parameters ********************************
99322 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - API: /annotate
99322 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - client ip: 127.0.0.1
99322 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - text: Barcelona is a city on the coast of northeastern Spain. It is the capital and largest city of the autonomous community of Catalonia, as well as the second most populous municipality of Spain. With a population of 1.6 million within city limits, its urban area extends to numerous neighbouring municipalities within the Province of Barcelona and is home to around 4.8 million people, making it the fifth most populous urban area in the European Union after Paris, the Ruhr area, Madrid, and Milan. It is one of the largest metropolises on the Mediterranean Sea, located on the coast between the mouths of the rivers Llobregat and Besòs, and bounded to the west by the Serra de Collserola mountain range, the tallest peak of which is 512 metres (1,680 feet) high. Founded as a Roman city, in the Middle Ages Barcelona became the capital of the County of Barcelona. After joining with the Kingdom of Aragon to form the confederation of the Crown of Aragon, Barcelona, which continued to be the capital of the Principality of Catalonia, became the most important city in the Crown of Aragon and the main economic and administrative centre of the Crown, only to be overtaken by Valencia, wrested from Arab domination by the Catalans, shortly before the dynastic union between the Crown of Castile and the Crown of Aragon in 1492. Barcelona has a rich cultural heritage and is today an important cultural centre and a major tourist destination. Particularly renowned are the architectural works of Antoni Gaudí and Lluís Domènech i Montaner, which have been designated UNESCO World Heritage Sites. The city is home to two of the most prestigious universities in Spain: the University of Barcelona and Pompeu Fabra University. The headquarters of the Union for the Mediterranean are located in Barcelona. The city is known for hosting the 1992 Summer Olympics as well as world-class conferences and expositions and also many international sport tournaments.
99322 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - text length in chars: 1950
99323 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - confidence: 0.5
99323 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - support: 0
99323 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - types: 
99323 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - sparqlQuery: 
99323 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - policy: false
99323 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - coreferenceResolution: true
99323 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - spotter: Default
99323 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.web.rest.SpotlightInterface - disambiguator: Default
Feb 02, 2023 2:56:00 PM com.github.fommil.netlib.BLAS <clinit>
WARNUNG: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
Feb 02, 2023 2:56:00 PM com.github.fommil.netlib.BLAS <clinit>
WARNUNG: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
99783 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.filter.annotations.TypeFilter - types are empty: showing all types
99787 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.filter.annotations.PercentageOfSecondFilter - (c=0.5) filtered out by threshold of second ranked percentage (0,826>0,750): SurfaceForm[Aragon] -0,506-> DBpediaResource[Crown_of_Aragon(Wikidata:Q6256,Schema:Place,Schema:Country,DBpedia:PopulatedPlace,DBpedia:Place,DBpedia:Location,DBpedia:Country)] - at position *897* in - Text[... y of Barcelona. After joining with the Kingdom of Aragon to form the confederation of the Crown of A ...]
99787 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.filter.annotations.PercentageOfSecondFilter - (c=0.5) filtered out by threshold of second ranked percentage (0,826>0,750): SurfaceForm[Aragon] -0,506-> DBpediaResource[Crown_of_Aragon(Wikidata:Q6256,Schema:Place,Schema:Country,DBpedia:PopulatedPlace,DBpedia:Place,DBpedia:Location,DBpedia:Country)] - at position *946* in - Text[...  Aragon to form the confederation of the Crown of Aragon, Barcelona, which continued to be the capit ...]
99787 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.filter.annotations.PercentageOfSecondFilter - (c=0.5) filtered out by threshold of second ranked percentage (0,826>0,750): SurfaceForm[Aragon] -0,506-> DBpediaResource[Crown_of_Aragon(Wikidata:Q6256,Schema:Place,Schema:Country,DBpedia:PopulatedPlace,DBpedia:Place,DBpedia:Location,DBpedia:Country)] - at position *1080* in - Text[... a, became the most important city in the Crown of Aragon and the main economic and administrative ce ...]
99788 [Grizzly-2222(1)] INFO org.dbpedia.spotlight.filter.annotations.PercentageOfSecondFilter - (c=0.5) filtered out by threshold of second ranked percentage (0,826>0,750): SurfaceForm[Aragon] -0,506-> DBpediaResource[Crown_of_Aragon(Wikidata:Q6256,Schema:Place,Schema:Country,DBpedia:PopulatedPlace,DBpedia:Place,DBpedia:Location,DBpedia:Country)] - at position *1309* in - Text[... ion between the Crown of Castile and the Crown of Aragon in 1492. Barcelona has a rich cultural heri ...]

In the same shell, I've tested substrings of the same text or other texts, and they work fine.

Processing the same text using the standard pipe configuration where the API is called works fine, too.

Hope this helps understand the problem a bit better.

EDIT: Added the log from the spotlight endpoint

Hi @acxcv ,
I tried yesterday on your collection that you shared, and unfortunately I could not reproduce the issue.

Now I tried with your example and no exception is happening. I am wondering what the cause may be.

I see that you are working with dbpedia jar 1.0, while I am currently running dbpedia jar 1.1. Could you try with the new version and let me know if the situation is the same?

(From README.md)

# download main jar
wget https://repo1.maven.org/maven2/org/dbpedia/spotlight/rest/1.1/rest-1.1-jar-with-dependencies.jar
# download latest model (last checked on 10/10/2022) (assuming en model)
wget -O en.tar.gz http://downloads.dbpedia.org/repo/dbpedia/spotlight/spotlight-model/2022.03.01/spotlight-model_lang=en.tar.gz
# extract model
tar xzf en.tar.gz
# run server
java -Xmx8G -jar rest-1.1-jar-with-dependencies.jar en http://localhost:2222/rest

Martino

acxcv commented

Worked flawlessly with dbpedia jar 1.1. Thank you!

That's a very good news!

@Kittyuzu1207 could you try to see if the upgrade solves the issue also for you?

Martino

Yes! The problem has been solved. Thank you for your efforts!