Error by Voikko analyzer: `ValueError: character U+fb50067 is not in range [U+0000; U+10ffff]`
juhoinkinen opened this issue · 13 comments
Word normalization by Voikko has failed at least on two occasions when using Annif v1.0 at Finto AI, which has resulted in 500 error response for the suggest request. On the occasions the error has been reproducible by sending suggest request for just the triggering word. Once noticed, the situation has been fixed by restarting Annif, after which the word has no longer produced the error.
The logs have show the following error (the character code has been different ):
(Click to expand)
ValueError: character U+fb50067 is not in range [U+0000; U+10ffff]
value = self.__lib.voikko_mor_analysis_value_ucs4(cAnalysis, key)
File "/usr/local/lib/python3.10/site-packages/voikko/libvoikko.py", line 682, in analyze
result = self.voikko.analyze(word)
File "/Annif/annif/analyzer/voikko.py", line 30, in _normalize_word
self._normalize_word(word)
File "/Annif/annif/analyzer/analyzer.py", line 48, in <listcomp>
return [
File "/Annif/annif/analyzer/analyzer.py", line 47, in tokenize_words
doc = tokenizer(doc)
File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 112, in _analyze
for feature in analyze(doc):
File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1276, in _count_vocab
_, X = self._count_vocab(raw_documents, fixed_vocab=True)
File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1434, in transform
sent_tokens = vectorizer.transform(sentences)
File "/Annif/annif/lexical/mllm.py", line 87, in generate_candidates
return generate_candidates(text, analyzer, self._vectorizer, self._index)
File "/Annif/annif/lexical/mllm.py", line 156, in generate_candidates
return self._model.generate_candidates(text, self.project.analyzer)
File "/Annif/annif/backend/mllm.py", line 153, in _generate_candidates
candidates = self._generate_candidates(text)
File "/Annif/annif/backend/mllm.py", line 166, in _suggest
[self._suggest(text, params) for text in texts],
File "/Annif/annif/backend/backend.py", line 129, in <listcomp>
[self._suggest(text, params) for text in texts],
File "/Annif/annif/backend/backend.py", line 129, in _suggest_batch
return self._suggest_batch(texts, params=beparams)
File "/Annif/annif/backend/backend.py", line 143, in suggest
return self.backend.suggest(texts, beparams)
File "/Annif/annif/project.py", line 143, in _suggest_with_backend
return self._suggest_with_backend(texts, backend_params)
File "/Annif/annif/project.py", line 253, in suggest
project_id: self.project.registry.get_project(project_id).suggest(texts)
File "/Annif/annif/backend/ensemble.py", line 46, in <dictcomp>
return {
File "/Annif/annif/backend/ensemble.py", line 45, in _suggest_with_sources
batch_by_source = self._suggest_with_sources(texts, sources)
File "/Annif/annif/backend/ensemble.py", line 71, in _suggest_batch
return self._suggest_batch(texts, params=beparams)
File "/Annif/annif/backend/backend.py", line 143, in suggest
return self.backend.suggest(texts, beparams)
File "/Annif/annif/project.py", line 143, in _suggest_with_backend
return self._suggest_with_backend(texts, backend_params)
File "/Annif/annif/project.py", line 253, in suggest
self.suggest([doc.text for doc in doc_batch], backend_params)
File "/Annif/annif/project.py", line 234, in <genexpr>
(batch.filter(limit, threshold) for batch in self.batches)
File "/Annif/annif/suggestion.py", line 159, in <genexpr>
return [
File "/Annif/annif/rest.py", line 99, in <listcomp>
return [
File "/Annif/annif/rest.py", line 99, in _hit_sets_to_list
return _hit_sets_to_list(hit_sets, project.subjects, lang)
File "/Annif/annif/rest.py", line 174, in _suggest
result = _suggest(project_id, documents, parameters)
File "/Annif/annif/rest.py", line 122, in suggest
return function(**kwargs)
File "/usr/local/lib/python3.10/site-packages/connexion/decorators/parameter.py", line 120, in wrapper
return function(request)
File "/usr/local/lib/python3.10/site-packages/connexion/decorators/validation.py", line 399, in wrapper
response = function(request)
File "/usr/local/lib/python3.10/site-packages/connexion/decorators/validation.py", line 196, in wrapper
response = function(request)
File "/usr/local/lib/python3.10/site-packages/connexion/decorators/uri_parsing.py", line 149, in wrapper
response = function(request)
File "/usr/local/lib/python3.10/site-packages/connexion/decorators/decorator.py", line 68, in wrapper
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "/usr/local/lib/python3.10/site-packages/flask_cors/extension.py", line 176, in wrapped_function
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
response = self.full_dispatch_request()
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
Traceback (most recent call last):
ERROR:annif:Exception on /v1/projects/yso-fi/suggest [POST]
I'll make a PR to help to analyze the bug in more detail.
What other character codes besides U+fb50067
have we seen? Could these be used to reverse engineer some possible explanation? For example, if a UTF-8 sequence somehow got cut in the middle, could that cause this?
The other character code was U+fb5005d
(the case reported by Anis), which arose from this document.
Both U+fb50067
and U+fb5005d
are way beyond the currently valid range of Unicode characters (up to U+10FFFF
) and it shouldn't even be possible to represent them with the current UTF-8 encoding. Though the obsolete RFC2279 does define UTF-8 sequences for a much higher range (up to U+7FFFFFFF
!), so it is at least theoretically possible that the input contains some byte sequence that, when decoded using an RFC2279 based UTF-8 decoder, will produce these characters in the UCS4 (four byte) representation that appears to be used internally in Voikko (basing my guess on the method name seen in the traceback - voikko_mor_analysis_value_ucs4
).
A different but possibly related error, which resulted in restarting Annif by itself, occurred at Finto AI (not test instance):
python: morphology/FinnishVfstAnalyzer.cpp:166: void libvoikko::morphology::decreaseCharsMissing(size_t&, size_t, size_t): Assertion `false' failed.
[ERROR] Worker (pid:85) was sent code 134!
Now the ValueError was triggered at test instance ai.dev.finto.fi, after about one day of continuous suggest requests using fuzzied text contents of Jyväskylä theses.
At 10:57 logs have these lines and then Annif restarts (oldest line at the bottom):
[2023-11-07 10:56:59 +0200] [47] [INFO] Booting worker with pid: 47
[2023-11-07 10:56:59 +0200] [1] [ERROR] Worker (pid:7) was sent code 134!
corrupted size vs. prev_size
I remember seeing corrupted size vs. prev_size
before, and I think then it did not cause further error.
Then just after Annif restarted:
annif.exception.OperationFailedException: Error in tokenization of text 'sivuaineopintojen suhteen eroa havaittiin vain kahden sivuaineryhmän välilä.'
raise OperationFailedException(
File "/Annif/annif/analyzer/analyzer.py", line 56, in tokenize_words
doc = tokenizer(doc)
File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 112, in _analyze
for feature in analyze(doc):
File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1276, in _count_vocab
_, X = self._count_vocab(raw_documents, fixed_vocab=True)
File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1434, in transform
sent_tokens = vectorizer.transform(sentences)
File "/Annif/annif/lexical/mllm.py", line 87, in generate_candidates
return generate_candidates(text, analyzer, self._vectorizer, self._index)
File "/Annif/annif/lexical/mllm.py", line 156, in generate_candidates
return self._model.generate_candidates(text, self.project.analyzer)
File "/Annif/annif/backend/mllm.py", line 153, in _generate_candidates
candidates = self._generate_candidates(text)
File "/Annif/annif/backend/mllm.py", line 166, in _suggest
[self._suggest(text, params) for text in texts],
File "/Annif/annif/backend/backend.py", line 129, in <listcomp>
[self._suggest(text, params) for text in texts],
File "/Annif/annif/backend/backend.py", line 129, in _suggest_batch
return self._suggest_batch(texts, params=beparams)
File "/Annif/annif/backend/backend.py", line 143, in suggest
return self.backend.suggest(texts, beparams)
File "/Annif/annif/project.py", line 143, in _suggest_with_backend
return self._suggest_with_backend(texts, backend_params)
File "/Annif/annif/project.py", line 253, in suggest
project_id: self.project.registry.get_project(project_id).suggest(texts)
File "/Annif/annif/backend/ensemble.py", line 46, in <dictcomp>
return {
File "/Annif/annif/backend/ensemble.py", line 45, in _suggest_with_sources
batch_by_source = self._suggest_with_sources(texts, sources)
File "/Annif/annif/backend/ensemble.py", line 71, in _suggest_batch
return self._suggest_batch(texts, params=beparams)
File "/Annif/annif/backend/backend.py", line 143, in suggest
return self.backend.suggest(texts, beparams)
File "/Annif/annif/project.py", line 143, in _suggest_with_backend
return self._suggest_with_backend(texts, backend_params)
File "/Annif/annif/project.py", line 253, in suggest
self.suggest([doc.text for doc in doc_batch], backend_params)
File "/Annif/annif/project.py", line 234, in <genexpr>
(batch.filter(limit, threshold) for batch in self.batches)
File "/Annif/annif/suggestion.py", line 159, in <genexpr>
return [
File "/Annif/annif/rest.py", line 99, in <listcomp>
return [
File "/Annif/annif/rest.py", line 99, in _hit_sets_to_list
return _hit_sets_to_list(hit_sets, project.subjects, lang)
File "/Annif/annif/rest.py", line 174, in _suggest
result = _suggest(project_id, documents, parameters)
File "/Annif/annif/rest.py", line 122, in suggest
return function(**kwargs)
File "/usr/local/lib/python3.10/site-packages/connexion/decorators/parameter.py", line 120, in wrapper
return function(request)
File "/usr/local/lib/python3.10/site-packages/connexion/decorators/validation.py", line 399, in wrapper
response = function(request)
File "/usr/local/lib/python3.10/site-packages/connexion/decorators/validation.py", line 196, in wrapper
response = function(request)
File "/usr/local/lib/python3.10/site-packages/connexion/decorators/uri_parsing.py", line 149, in wrapper
response = function(request)
File "/usr/local/lib/python3.10/site-packages/connexion/decorators/decorator.py", line 68, in wrapper
return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
return cors_after_request(app.make_response(f(*args, **kwargs)))
File "/usr/local/lib/python3.10/site-packages/flask_cors/extension.py", line 176, in wrapped_function
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
response = self.full_dispatch_request()
File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
Traceback (most recent call last):
The above exception was the direct cause of the following exception:
annif.exception.OperationFailedException: Voikko error in analysis of word 'eroa'
raise OperationFailedException(
File "/Annif/annif/analyzer/voikko.py", line 35, in _normalize_word
self._normalize_word(word)
File "/Annif/annif/analyzer/analyzer.py", line 51, in <listcomp>
return [
File "/Annif/annif/analyzer/analyzer.py", line 50, in tokenize_words
Traceback (most recent call last):
The above exception was the direct cause of the following exception:
ValueError: character U+3e3ffe4e is not in range [U+0000; U+10ffff]
value = self.__lib.voikko_mor_analysis_value_ucs4(cAnalysis, key)
File "/usr/local/lib/python3.10/site-packages/voikko/libvoikko.py", line 682, in analyze
result = self.voikko.analyze(word)
File "/Annif/annif/analyzer/voikko.py", line 33, in _normalize_word
Traceback (most recent call last):
ERROR:annif:Exception on /v1/projects/yso-fi/suggest [POST]
INFO:annif:finished initializing projects
So "eroa" was the word for which the error occurred.
After this the error occurred for many more words (following are "niitä", "tukea"), could be for any word that is not in LRU cache.
Edit: More logs Annif-logs-2023-11-07 12 13 40.txt
After manually restarting Annif the words that previously triggered the error no longer do so.
From the above error log, apparently this was the text to tokenize:
'sivuaineopintojen suhteen eroa havaittiin vain kahden sivuaineryhmän välilä.'
What's the Unicode character near the end?
From the above error log, apparently this was the text to tokenize:
'sivuaineopintojen suhteen eroa havaittiin vain kahden sivuaineryhmän välilä.'
What's the Unicode character near the end?
That seems to be U+10A6BF, which is above valid range, but not the one which the error message mentions (U+3e3ffe4e). Inputting or the full above text via the web UI does not re-trigger the error.
I start to suspect the error has something to do in how Annif is started. Anyway, I switched to send suggest request for Jyväskylä theses, but not fuzzing them.
My hunch is that at the point this error happens, something has already gone wrong somewhere, likely inside Voikko. Maybe there is some internal cache or other data structure that has been corrupted? The word that triggers the error (here eroa
) is pretty much irrelevant; any word that bypasses the LRU cache and goes to Voikko could trigger it. The problematic characters are in the response from Voikko; when that response passes the boundary between the C extension and Python, the ValueError is triggered.
Here is a somewhat similar situation involving locale-aware collation. In that case it's the BSD / OS X standard library that produces a string with illegal Unicode characters, and Python cannot handle them.
The just occurred restart of Annif container (the pod remained running) is possibly related to this issue. The logs show
2023-12-05 11:45:54 [1] [ERROR] Worker (pid:7) was sent code 134!
2023-12-05 11:45:52 corrupted size vs. prev_size
komodor.com tells about this exit code:
Exit Code 134: Abnormal Termination (SIGABRT)
Exit Code 134 means that the container abnormally terminated itself, closed the process and flushed open streams. This operation is irreversible, like SIGKILL (see Exit Code 137 below). A process can trigger SIGABRT by doing one of the following:
Calling the abort() function in the libc library
Calling the assert() macro, used for debugging. The process is then aborted if the assertion is false.
Once again there was a sudden restart of Annif container of Finto AI, this time with a bit different error message:
2023-12-14 16:41:05 [ERROR] Worker (pid:7) was sent code 134!
2023-12-14 16:41:05 free(): invalid next size (fast)
Edit: And the restart occurred about 4 hours after the previous boot (for the update to v1.0.1).
Many months passed without problems, but now (after ~5 hours since cluster reboot) the Annif container crashed with a new message:
2024-02-27 12:00:47 INFO:annif:finished initializing projects
2024-02-27 17:14:40 double free or corruption (!prev)
2024-02-27 17:17:46 [2024-02-27 17:17:43 +0200] [1] [ERROR] Worker (pid:7) was sent code 134!
Again about 12 hours after cluster reboot:
2024-03-26 07:03:07 [2024-03-26 07:02:56 +0200] [49] [INFO] Booting worker with pid: 49
2024-03-26 07:03:07 [2024-03-26 07:02:56 +0200] [1] [ERROR] Worker (pid:7) was sent code
2024-03-26 07:00:13 corrupted size vs. prev_size
2024-03-25 16:43:10 INFO:annif:finished initializing projects