cyrusimap/cyrus-imapd

squatter fatal error, assertion failed

Closed this issue · 13 comments

When testing a new server, I've hit the same bug as in issue #4336, but as there was no real solution, I started debugging.

After many hours digging through the code and adding debugging fprintf's, I found the reason for the failed assertion.

The error happens when the existing squat index is read and then used to build the new index. Specifically, when doc_ID_map_lookup() is called with a docID that is equal to doc_ID_map->max, doc_ID_map_lookup() returns the data after the last initialised member of the array. This memory spot is properly allocated, but contains random data.

This patch fixes the issue:

diff -up cyrus-imapd-3.4.1/imap/squat_build.c.maplookup cyrus-imapd-3.4.1/imap/squat_build.c
--- cyrus-imapd-3.4.1/imap/squat_build.c.maplookup	2021-05-05 05:21:59.000000000 +0200
+++ cyrus-imapd-3.4.1/imap/squat_build.c	2023-09-21 14:17:12.851281278 +0200
@@ -355,7 +355,7 @@ static void doc_ID_map_add(struct doc_ID
             xrealloc(doc_ID_map->map, doc_ID_map->alloc * sizeof(int));
     }
     if (exists) {
-        doc_ID_map->map[doc_ID_map->max++] = doc_ID_map->new++;
+        doc_ID_map->map[doc_ID_map->max++] = ++doc_ID_map->new;
     } else {
         doc_ID_map->map[doc_ID_map->max++] = 0; /* Does not exist in new index */
     }
@@ -366,7 +366,7 @@ static int doc_ID_map_lookup(struct doc_
     if ((docID < 1) || (docID > doc_ID_map->max))
         return (0);
 
-    return (doc_ID_map->map[docID]);
+    return (doc_ID_map->map[--docID]);
 }
 
 static int copy_docIDs(void *closure, SquatListDoc const *doc)

[edited to wrap patch in ```'s -- ellie]

@rsto This sounds promising. Do you know squat well enough to tell whether the patch looks right? I don't know the data model so I can't tell whether the pre/post-increment/decrement changes are good just by looking at the patch in isolation. I have not yet applied the patch to look at it in context.

It would be nice to have a regression test that proves the fault and the fix. Sounds like part of the trick will be running squatter twice on the same mailbox, though I'm not sure how to force "docID that is equal to doc_ID_map->max", because I don't know what docIDs are, nor how or why they're mapped. Does this mean more to you than it does to me?

rsto commented

Does this mean more to you than it does to me?

It does not really. I'm fine with this change as long as it's contained in the squat index code.

@RobertVogelgesang,

when doc_ID_map_lookup() is called with a docID that is equal to doc_ID_map->max, doc_ID_map_lookup() returns the data after the last initialised member of the array

Do you know of a way to reliably provoke this situation? I'm trying to put together a regression test to prove the fix, but have not yet managed to reproduce the problem.

brong commented

It looks like this might have been introduced in 2e0503c

I've been real deep down this rabbit hole...

I can now reproduce the assertion failure on demand. It requires two things: a mailbox with enough messages that its doc_ID_map needs to realloc, and a message in that mailbox with no subject/from/to headers. For the latter, I'm reproducing it with a message that's missing all three. I have not yet tested the missing headers individually or in other combinations to find which exact combinations trigger it.

Findings:

  • The commit @brong identified in the previous comment looks innocent
  • For some as-yet-undetermined reason, in some as-yet-undetermined way, messages missing those three headers don't get indexed properly by an initial indexing
  • An incremental reindex detects them as not-yet-indexed, and tries to reindex them
  • The doc_ID_map logic is bogus in a few ways. One of which is that doc_ID_map_lookup returns uninitialised junk when docID == doc_ID_map->max, as @RobertVogelgesang observed
  • The incremental reindex asserts on the bad mapping when trying to reindex the bad message. The patch in OP avoids the assert, but doesn't fix the bigger problem, that docIDs start at 1, not zero
  • Turns out there has also been a bug present since 2.5.0 that all squat reindexes are incremental due to mishandling of a flags variable
  • Since there's no way to do a non-incremental reindex, these messages always cause problems. They only don't during an initial indexing (which is why removing the index files and indexing from scratch seems to work, but the problem then returns).
  • Prior to 2.5, one could avoid all this by just never doing incrementals, which might be why this wasn't detected back when squat was the main search backend

So, stuff to fix:

  • fix the bug that made all reindexes incremental (done, with a test)
  • fix the doc_ID_map properly (not done yet, because the assertions seem useful for tracking down the real cause)
  • figure out why messages with no subject/from/to don't index properly in the first place

If I can't figure out and fix the initial indexing of messages with no subject/from/to, then at least fixing the other issues will make it less of a problem.

Hello, I have the same problem. Most folders where squatter fails are Drafts or Trash (and their translated versions), so it is highly probable that they contain messages without From/To/Subject headers. I'm not sure I uderstand correctly the situation after the above fixes. Is it that these still fail on incremental squat, but should be successful at least during full squat run?
Vladki

With the fixes in #4678, the affected mailboxes should be successful on all squatter runs.

Without the fixes, the affected mailboxes will fail on incremental reindexes, and without the fixes all reindexes are incremental. Which is to say, as far as I can tell, the only way to successfully index affected mailboxes without the fixes is to delete their old indexes and index them again from scratch. (Or: identify and remove the messages that are causing the problem.)

The cause is not specifically missing From/To/Subject headers, but that is a reliable way to provoke it. The actual cause is messages where the first possibly-indexable header is one that the squat backend doesn't want to index (Content-Type is one such, there might be others). As long as something like From/To/Subject occurs before Content-Type, which is usually the case, then there isn't a problem. But if it sees a Content-Type when it hasn't already seen a From, To, or Subject, then it chokes.

What version are you running? I still need to backport the fixes to the older branches, it's currently only fixed on master. I can prioritise backporting for the version you're using if you want to build from the git branch to get the fixes early. Otherwise, I'll do the backporting before the next releases, but not necessarily right away.

I'm using Debian bookworm: 3.6.1-4+deb12u1

More info - this happens irregularly. Now I run squatter in a per user loop, so that if one fails, the others are still indexed, and I see the same mailboxes failing often, but not every day. Most often it fails on Drafts/Trash/Sent (and their localized variants)

Hello, is there any chance to see these fixes backported to 3.6.x ? May I help somehow? I.e. by trying to index problematic folders with patched squatter?

@vladki77 Thanks for the reminder. I intend to backport the fixes to 3.6, but something else came up and then I forgot; sorry! I'll try to get it done ASAP

Oh, I have not noticed that the fix is backported... Now we are running fixed version, and sqautter is finally not failing. Thank you very much.