NatLibFi/Annif

Error by Voikko analyzer: `ValueError: character U+fb50067 is not in range [U+0000; U+10ffff]`

juhoinkinen opened this issue · 13 comments

Word normalization by Voikko has failed at least on two occasions when using Annif v1.0 at Finto AI, which has resulted in 500 error response for the suggest request. On the occasions the error has been reproducible by sending suggest request for just the triggering word. Once noticed, the situation has been fixed by restarting Annif, after which the word has no longer produced the error.

The logs have show the following error (the character code has been different ):

(Click to expand)
ValueError: character U+fb50067 is not in range [U+0000; U+10ffff]
    value = self.__lib.voikko_mor_analysis_value_ucs4(cAnalysis, key)
    File "/usr/local/lib/python3.10/site-packages/voikko/libvoikko.py", line 682, in analyze
      result = self.voikko.analyze(word)
    File "/Annif/annif/analyzer/voikko.py", line 30, in _normalize_word
      self._normalize_word(word)
    File "/Annif/annif/analyzer/analyzer.py", line 48, in <listcomp>
      return [
    File "/Annif/annif/analyzer/analyzer.py", line 47, in tokenize_words
      doc = tokenizer(doc)
    File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 112, in _analyze
      for feature in analyze(doc):
    File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1276, in _count_vocab
      _, X = self._count_vocab(raw_documents, fixed_vocab=True)
    File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1434, in transform
      sent_tokens = vectorizer.transform(sentences)
    File "/Annif/annif/lexical/mllm.py", line 87, in generate_candidates
      return generate_candidates(text, analyzer, self._vectorizer, self._index)
    File "/Annif/annif/lexical/mllm.py", line 156, in generate_candidates
      return self._model.generate_candidates(text, self.project.analyzer)
    File "/Annif/annif/backend/mllm.py", line 153, in _generate_candidates
      candidates = self._generate_candidates(text)
    File "/Annif/annif/backend/mllm.py", line 166, in _suggest
      [self._suggest(text, params) for text in texts],
    File "/Annif/annif/backend/backend.py", line 129, in <listcomp>
      [self._suggest(text, params) for text in texts],
    File "/Annif/annif/backend/backend.py", line 129, in _suggest_batch
      return self._suggest_batch(texts, params=beparams)
    File "/Annif/annif/backend/backend.py", line 143, in suggest
      return self.backend.suggest(texts, beparams)
    File "/Annif/annif/project.py", line 143, in _suggest_with_backend
      return self._suggest_with_backend(texts, backend_params)
    File "/Annif/annif/project.py", line 253, in suggest
      project_id: self.project.registry.get_project(project_id).suggest(texts)
    File "/Annif/annif/backend/ensemble.py", line 46, in <dictcomp>
      return {
    File "/Annif/annif/backend/ensemble.py", line 45, in _suggest_with_sources
      batch_by_source = self._suggest_with_sources(texts, sources)
    File "/Annif/annif/backend/ensemble.py", line 71, in _suggest_batch
      return self._suggest_batch(texts, params=beparams)
    File "/Annif/annif/backend/backend.py", line 143, in suggest
      return self.backend.suggest(texts, beparams)
    File "/Annif/annif/project.py", line 143, in _suggest_with_backend
      return self._suggest_with_backend(texts, backend_params)
    File "/Annif/annif/project.py", line 253, in suggest
      self.suggest([doc.text for doc in doc_batch], backend_params)
    File "/Annif/annif/project.py", line 234, in <genexpr>
      (batch.filter(limit, threshold) for batch in self.batches)
    File "/Annif/annif/suggestion.py", line 159, in <genexpr>
      return [
    File "/Annif/annif/rest.py", line 99, in <listcomp>
      return [
    File "/Annif/annif/rest.py", line 99, in _hit_sets_to_list
      return _hit_sets_to_list(hit_sets, project.subjects, lang)
    File "/Annif/annif/rest.py", line 174, in _suggest
      result = _suggest(project_id, documents, parameters)
    File "/Annif/annif/rest.py", line 122, in suggest
      return function(**kwargs)
    File "/usr/local/lib/python3.10/site-packages/connexion/decorators/parameter.py", line 120, in wrapper
      return function(request)
    File "/usr/local/lib/python3.10/site-packages/connexion/decorators/validation.py", line 399, in wrapper
      response = function(request)
    File "/usr/local/lib/python3.10/site-packages/connexion/decorators/validation.py", line 196, in wrapper
      response = function(request)
    File "/usr/local/lib/python3.10/site-packages/connexion/decorators/uri_parsing.py", line 149, in wrapper
      response = function(request)
    File "/usr/local/lib/python3.10/site-packages/connexion/decorators/decorator.py", line 68, in wrapper
      return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
    File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
      rv = self.dispatch_request()
    File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
      return cors_after_request(app.make_response(f(*args, **kwargs)))
    File "/usr/local/lib/python3.10/site-packages/flask_cors/extension.py", line 176, in wrapped_function
      rv = self.handle_user_exception(e)
    File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
      response = self.full_dispatch_request()
    File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
Traceback (most recent call last):
ERROR:annif:Exception on /v1/projects/yso-fi/suggest [POST]

I'll make a PR to help to analyze the bug in more detail.

osma commented

What other character codes besides U+fb50067 have we seen? Could these be used to reverse engineer some possible explanation? For example, if a UTF-8 sequence somehow got cut in the middle, could that cause this?

The other character code was U+fb5005d (the case reported by Anis), which arose from this document.

osma commented

Both U+fb50067 and U+fb5005d are way beyond the currently valid range of Unicode characters (up to U+10FFFF) and it shouldn't even be possible to represent them with the current UTF-8 encoding. Though the obsolete RFC2279 does define UTF-8 sequences for a much higher range (up to U+7FFFFFFF!), so it is at least theoretically possible that the input contains some byte sequence that, when decoded using an RFC2279 based UTF-8 decoder, will produce these characters in the UCS4 (four byte) representation that appears to be used internally in Voikko (basing my guess on the method name seen in the traceback - voikko_mor_analysis_value_ucs4).

A different but possibly related error, which resulted in restarting Annif by itself, occurred at Finto AI (not test instance):

python: morphology/FinnishVfstAnalyzer.cpp:166: void libvoikko::morphology::decreaseCharsMissing(size_t&, size_t, size_t): Assertion `false' failed.
[ERROR] Worker (pid:85) was sent code 134!

Now the ValueError was triggered at test instance ai.dev.finto.fi, after about one day of continuous suggest requests using fuzzied text contents of Jyväskylä theses.

At 10:57 logs have these lines and then Annif restarts (oldest line at the bottom):

[2023-11-07 10:56:59 +0200] [47] [INFO] Booting worker with pid: 47
[2023-11-07 10:56:59 +0200] [1] [ERROR] Worker (pid:7) was sent code 134!
corrupted size vs. prev_size

I remember seeing corrupted size vs. prev_size before, and I think then it did not cause further error.

Then just after Annif restarted:

annif.exception.OperationFailedException: Error in tokenization of text 'sivuaineopintojen suhteen eroa havaittiin vain kahden sivuaineryhmän välil􊚿ä.'
    raise OperationFailedException(
  File "/Annif/annif/analyzer/analyzer.py", line 56, in tokenize_words
    doc = tokenizer(doc)
  File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 112, in _analyze
    for feature in analyze(doc):
  File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1276, in _count_vocab
    _, X = self._count_vocab(raw_documents, fixed_vocab=True)
  File "/usr/local/lib/python3.10/site-packages/sklearn/feature_extraction/text.py", line 1434, in transform
    sent_tokens = vectorizer.transform(sentences)
  File "/Annif/annif/lexical/mllm.py", line 87, in generate_candidates
    return generate_candidates(text, analyzer, self._vectorizer, self._index)
  File "/Annif/annif/lexical/mllm.py", line 156, in generate_candidates
    return self._model.generate_candidates(text, self.project.analyzer)
  File "/Annif/annif/backend/mllm.py", line 153, in _generate_candidates
    candidates = self._generate_candidates(text)
  File "/Annif/annif/backend/mllm.py", line 166, in _suggest
    [self._suggest(text, params) for text in texts],
  File "/Annif/annif/backend/backend.py", line 129, in <listcomp>
    [self._suggest(text, params) for text in texts],
  File "/Annif/annif/backend/backend.py", line 129, in _suggest_batch
    return self._suggest_batch(texts, params=beparams)
  File "/Annif/annif/backend/backend.py", line 143, in suggest
    return self.backend.suggest(texts, beparams)
  File "/Annif/annif/project.py", line 143, in _suggest_with_backend
    return self._suggest_with_backend(texts, backend_params)
  File "/Annif/annif/project.py", line 253, in suggest
    project_id: self.project.registry.get_project(project_id).suggest(texts)
  File "/Annif/annif/backend/ensemble.py", line 46, in <dictcomp>
    return {
  File "/Annif/annif/backend/ensemble.py", line 45, in _suggest_with_sources
    batch_by_source = self._suggest_with_sources(texts, sources)
  File "/Annif/annif/backend/ensemble.py", line 71, in _suggest_batch
    return self._suggest_batch(texts, params=beparams)
  File "/Annif/annif/backend/backend.py", line 143, in suggest
    return self.backend.suggest(texts, beparams)
  File "/Annif/annif/project.py", line 143, in _suggest_with_backend
    return self._suggest_with_backend(texts, backend_params)
  File "/Annif/annif/project.py", line 253, in suggest
    self.suggest([doc.text for doc in doc_batch], backend_params)
  File "/Annif/annif/project.py", line 234, in <genexpr>
    (batch.filter(limit, threshold) for batch in self.batches)
  File "/Annif/annif/suggestion.py", line 159, in <genexpr>
    return [
  File "/Annif/annif/rest.py", line 99, in <listcomp>
    return [
  File "/Annif/annif/rest.py", line 99, in _hit_sets_to_list
    return _hit_sets_to_list(hit_sets, project.subjects, lang)
  File "/Annif/annif/rest.py", line 174, in _suggest
    result = _suggest(project_id, documents, parameters)
  File "/Annif/annif/rest.py", line 122, in suggest
    return function(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/connexion/decorators/parameter.py", line 120, in wrapper
    return function(request)
  File "/usr/local/lib/python3.10/site-packages/connexion/decorators/validation.py", line 399, in wrapper
    response = function(request)
  File "/usr/local/lib/python3.10/site-packages/connexion/decorators/validation.py", line 196, in wrapper
    response = function(request)
  File "/usr/local/lib/python3.10/site-packages/connexion/decorators/uri_parsing.py", line 149, in wrapper
    response = function(request)
  File "/usr/local/lib/python3.10/site-packages/connexion/decorators/decorator.py", line 68, in wrapper
    return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1799, in dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1823, in full_dispatch_request
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python3.10/site-packages/flask_cors/extension.py", line 176, in wrapped_function
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 1825, in full_dispatch_request
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.10/site-packages/flask/app.py", line 2529, in wsgi_app
Traceback (most recent call last):
The above exception was the direct cause of the following exception:
annif.exception.OperationFailedException: Voikko error in analysis of word 'eroa'
    raise OperationFailedException(
  File "/Annif/annif/analyzer/voikko.py", line 35, in _normalize_word
    self._normalize_word(word)
  File "/Annif/annif/analyzer/analyzer.py", line 51, in <listcomp>
    return [
  File "/Annif/annif/analyzer/analyzer.py", line 50, in tokenize_words
Traceback (most recent call last):
The above exception was the direct cause of the following exception:
ValueError: character U+3e3ffe4e is not in range [U+0000; U+10ffff]
    value = self.__lib.voikko_mor_analysis_value_ucs4(cAnalysis, key)
  File "/usr/local/lib/python3.10/site-packages/voikko/libvoikko.py", line 682, in analyze
    result = self.voikko.analyze(word)
  File "/Annif/annif/analyzer/voikko.py", line 33, in _normalize_word
Traceback (most recent call last):
ERROR:annif:Exception on /v1/projects/yso-fi/suggest [POST]
INFO:annif:finished initializing projects

So "eroa" was the word for which the error occurred.

After this the error occurred for many more words (following are "niitä", "tukea"), could be for any word that is not in LRU cache.

Edit: More logs Annif-logs-2023-11-07 12 13 40.txt

After manually restarting Annif the words that previously triggered the error no longer do so.

osma commented

From the above error log, apparently this was the text to tokenize:

'sivuaineopintojen suhteen eroa havaittiin vain kahden sivuaineryhmän välil􊚿ä.'

What's the Unicode character near the end?

From the above error log, apparently this was the text to tokenize:

'sivuaineopintojen suhteen eroa havaittiin vain kahden sivuaineryhmän välil􊚿ä.'

What's the Unicode character near the end?

That seems to be U+10A6BF, which is above valid range, but not the one which the error message mentions (U+3e3ffe4e). Inputting 􊚿 or the full above text via the web UI does not re-trigger the error.

I start to suspect the error has something to do in how Annif is started. Anyway, I switched to send suggest request for Jyväskylä theses, but not fuzzing them.

osma commented

My hunch is that at the point this error happens, something has already gone wrong somewhere, likely inside Voikko. Maybe there is some internal cache or other data structure that has been corrupted? The word that triggers the error (here eroa) is pretty much irrelevant; any word that bypasses the LRU cache and goes to Voikko could trigger it. The problematic characters are in the response from Voikko; when that response passes the boundary between the C extension and Python, the ValueError is triggered.

Here is a somewhat similar situation involving locale-aware collation. In that case it's the BSD / OS X standard library that produces a string with illegal Unicode characters, and Python cannot handle them.

The just occurred restart of Annif container (the pod remained running) is possibly related to this issue. The logs show

2023-12-05 11:45:54	[1] [ERROR] Worker (pid:7) was sent code 134!
2023-12-05 11:45:52	corrupted size vs. prev_size

komodor.com tells about this exit code:

Exit Code 134: Abnormal Termination (SIGABRT)

Exit Code 134 means that the container abnormally terminated itself, closed the process and flushed open streams. This operation is irreversible, like SIGKILL (see Exit Code 137 below). A process can trigger SIGABRT by doing one of the following:

Calling the abort() function in the libc library
Calling the assert() macro, used for debugging. The process is then aborted if the assertion is false.

Once again there was a sudden restart of Annif container of Finto AI, this time with a bit different error message:

2023-12-14 16:41:05 [ERROR] Worker (pid:7) was sent code 134!
2023-12-14 16:41:05 free(): invalid next size (fast)

Edit: And the restart occurred about 4 hours after the previous boot (for the update to v1.0.1).

Many months passed without problems, but now (after ~5 hours since cluster reboot) the Annif container crashed with a new message:

2024-02-27 12:00:47	INFO:annif:finished initializing projects
2024-02-27 17:14:40	double free or corruption (!prev)
2024-02-27 17:17:46	[2024-02-27 17:17:43 +0200] [1] [ERROR] Worker (pid:7) was sent code 134!

Again about 12 hours after cluster reboot:

2024-03-26 07:03:07	[2024-03-26 07:02:56 +0200] [49] [INFO] Booting worker with pid: 49
2024-03-26 07:03:07	[2024-03-26 07:02:56 +0200] [1] [ERROR] Worker (pid:7) was sent code 
2024-03-26 07:00:13	corrupted size vs. prev_size
2024-03-25 16:43:10	INFO:annif:finished initializing projects