cyrusimap/cyrus-imapd

SOLVED - squatter fatal error, assertion failed - v. 3.6.0-beta3-1+b1

Closed this issue · 48 comments

We use the cyrus version 3.6.0-beta3-1+b1 and I found an issue with squatter.
In some accounts there are a folder, where the squatter dies with fatal error and the whole squatter process cannot move on.

I mean, there are about 40 accounts, called a.... to v....
The squatter starts and can finish about 4 accounts, up to user 'foo' where it run into a folder, I assume it found an email and id dies with the following message:

process type:EVENT name:squatter1 path:/usr/sbin/cyrus age:138.664s pid:31781 exited, status 70

When I issue the command in command line, as cyrus user
/usr/lib/cyrus/bin/squatter -v -p -u foo

I got the following, more specific error message:
Indexing mailbox user/foo/Archives/2013... fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0

As I checked the squatter_internal.c it did not change for years (github).

The problem with this, the periodic squatter just dies in the very beginning and other accounts never will be squattered.

This server populated using recent imapsync from old cyrus server.
The used squatter is: squat (not xapian).

Now I found an email, which exactly cause an assertion failed, I attached it.
In short, with this email the squattering dies with assertion failed.
When I deleted this email and reconstructed with -r -x, the squatter finishing as expected.

469.txt

Ok, something really strange, now it dies one message earlier:
468.

Even, when the 469 was in place, the 468 indexed and assertion happened with 469.

Now I deleted the 469 -> reconstruct with -r -x -> squatter -> assertion failed during indexing 468.
Frankly, I am lost.

Here is the end of the strace output:

write(21, "\201\36\10 LE SP X-* L* S* X** ***\201\37\0050*"..., 10494) = 10494
lseek(21, 0, SEEK_SET)                  = 0
mmap(NULL, 10494, PROT_READ, MAP_SHARED, 21, 0) = 0x7f20d06ca000
munmap(0x7f20d06ca000, 10494)           = 0
close(21)                               = 0
write(22, "\201\36\6 X-+ X++ +++020JS9\201\37\247h +D +R "..., 43559) = 43559
lseek(22, 0, SEEK_SET)                  = 0
mmap(NULL, 43559, PROT_READ, MAP_SHARED, 22, 0) = 0x7f20cba44000
munmap(0x7f20cba44000, 43559)           = 0
close(22)                               = 0
write(23, "\201\36\f 03 04 4  BO DA DK FR PO RC S"..., 17820) = 17820
lseek(23, 0, SEEK_SET)                  = 0
mmap(NULL, 17820, PROT_READ, MAP_SHARED, 23, 0) = 0x7f20d06c8000
munmap(0x7f20d06c8000, 17820)           = 0
close(23)                               = 0
write(24, "\201\36\0370701.999 AGEAUTBARFLAFROID:IP"..., 44271) = 44271
lseek(24, 0, SEEK_SET)                  = 0
mmap(NULL, 44271, PROT_READ, MAP_SHARED, 24, 0) = 0x7f20cba44000
write(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262126) = 262126
munmap(0x7f20cba44000, 44271)           = 0
close(24)                               = 0
write(25, "\201\36() C0 C0.00.10010551, 10010211"..., 74850) = 74850
lseek(25, 0, SEEK_SET)                  = 0
mmap(NULL, 74850, PROT_READ, MAP_SHARED, 25, 0) = 0x7f20cb592000
munmap(0x7f20cb592000, 74850)           = 0
close(25)                               = 0
write(11, "\201\35\1PLA\201\36\0101.31286AFCN9M TMIXQJRSI"..., 47363) = 47363
lseek(11, 0, SEEK_SET)                  = 0
mmap(NULL, 47363, PROT_READ, MAP_SHARED, 11, 0) = 0x7f20cba43000
**write(2, "fatal error: Internal error: ass"..., 84) = 84**
close(3)                                = 0
munmap(0x7f20cba5f000, 13082624)        = 0
exit_group(70)                          = ?
+++ exited with 70 +++

I checked, the 469 was the last message in that directory.
After I deleted the 469, the last one is 468 now.
There are no more messages.

So, it seems in this case squatter dies with the last one?

By the way, I stopped cyrus-imapd during my squatter attempts.
This is a cloned system of a production (zfs clone + test lxc).
It seems, after I started the cyrus-imapd (squatter in rolling mode now), the manually started squatter process was able to finish squattering one problematic user, same as I reported above.

So, I started again the iterative manual squatter -> check problems using strace log -> delete etc. problematic email -> reconstruct -> squatter again process.

Hmmm, I am reading the strace log, I afraid, I misunderstood, what happens.

It seems, squatter opened the last file, closed it and after that it is processing something and the assertion happens later, than open the last message.

openat(AT_FDCWD, "/var/spool/cyrus/mail/uuid/j/9/j9x6ytfm3ain3r60yd9zcemw/467.", O_RDONLY) = 89
newfstatat(89, "", {st_mode=S_IFREG|0600, st_size=76444, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 76444, PROT_READ, MAP_SHARED, 89, 0) = 0x7f45ac43e000
close(89)                               = 0
munmap(0x7f45ac43e000, 76444)           = 0
openat(AT_FDCWD, "/var/spool/cyrus/mail/uuid/j/9/j9x6ytfm3ain3r60yd9zcemw/468.", O_RDONLY) = 89     <----- Here the last message
newfstatat(89, "", {st_mode=S_IFREG|0600, st_size=76431, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 76431, PROT_READ, MAP_SHARED, 89, 0) = 0x7f45ac43e000
close(89)                               = 0                                 <----- Here closed the last message, processing, something happens below
munmap(0x7f45ac43e000, 76431)           = 0
write(17, "RECREPRETSENSERSO,SORSUSSYSTEMTE"..., 52875) = 52875
lseek(17, 0, SEEK_SET)                  = 0
mmap(NULL, 315016, PROT_READ, MAP_SHARED, 17, 0) = 0x7f45ac404000
munmap(0x7f45ac404000, 315016)          = 0
close(17)                               = 0
write(71, "\201'\1\". \201(\1\". \2011\1\". \2012\1\". \201;\1\". \201<"..., 4113) = 4113
lseek(71, 0, SEEK_SET)                  = 0
mmap(NULL, 4113, PROT_READ, MAP_SHARED, 71, 0) = 0x7f45b1576000
munmap(0x7f45b1576000, 4113)            = 0
close(71)                               = 0
write(18, "\201\36\4 DA) H=_3INV\201&\6 <E X----; BMU"..., 10245) = 10245
lseek(18, 0, SEEK_SET)                  = 0
mmap(NULL, 10245, PROT_READ, MAP_SHARED, 18, 0) = 0x7f45b1575000
munmap(0x7f45b1575000, 10245)           = 0
close(18)                               = 0
write(88, "\236y\1FFF", 6)              = 6
lseek(88, 0, SEEK_SET)                  = 0
mmap(NULL, 6, PROT_READ, MAP_SHARED, 88, 0) = 0x7f45b1577000
munmap(0x7f45b1577000, 6)               = 0
close(88)                               = 0
write(74, "\205e\1MY.\205m\1MY.\206\r\1MY.\2065\1MY.\240&\1 15\240'"..., 36) = 36
lseek(74, 0, SEEK_SET)                  = 0
mmap(NULL, 36, PROT_READ, MAP_SHARED, 74, 0) = 0x7f45b1577000
munmap(0x7f45b1577000, 36)              = 0
close(74)                               = 0
write(66, "\201\37\t20220L20P20_A1NA9LA9TB3SC3%\201'"..., 2466) = 2466
lseek(66, 0, SEEK_SET)                  = 0
mmap(NULL, 2466, PROT_READ, MAP_SHARED, 66, 0) = 0x7f45b1577000
munmap(0x7f45b1577000, 2466)            = 0
close(66)                               = 0
write(82, "\2131\1 LA\2135\4 LAC=8L=1N=4\2137\1 LA\213;\4 L"..., 66) = 66
lseek(82, 0, SEEK_SET)                  = 0
mmap(NULL, 66, PROT_READ, MAP_SHARED, 82, 0) = 0x7f45b1577000
munmap(0x7f45b1577000, 66)              = 0
close(82)                               = 0
write(67, "\201\37\2'TRTRA\206P\2'TRTRA\217_\2M SS A\217g\2M "..., 1872) = 1872
lseek(67, 0, SEEK_SET)                  = 0
mmap(NULL, 1872, PROT_READ, MAP_SHARED, 67, 0) = 0x7f45b1577000
munmap(0x7f45b1577000, 1872)            = 0
close(67)                               = 0
write(19, "\201\36\16\"IN128AMACESCLECYREVEISPLOCNO"..., 17502) = 17502
lseek(19, 0, SEEK_SET)                  = 0
mmap(NULL, 17502, PROT_READ, MAP_SHARED, 19, 0) = 0x7f45b1573000
munmap(0x7f45b1573000, 17502)           = 0
close(19)                               = 0
write(20, "\201\36\v (A (N (U BY CL HE RE SM WI X"..., 13233) = 13233
lseek(20, 0, SEEK_SET)                  = 0
mmap(NULL, 13233, PROT_READ, MAP_SHARED, 20, 0) = 0x7f45b1574000
munmap(0x7f45b1574000, 13233)           = 0
close(20)                               = 0
write(21, "\201\36\10 LE SP X-* L* S* X** ***\201\37\0050*"..., 10494) = 10494
lseek(21, 0, SEEK_SET)                  = 0
mmap(NULL, 10494, PROT_READ, MAP_SHARED, 21, 0) = 0x7f45b1575000
munmap(0x7f45b1575000, 10494)           = 0
close(21)                               = 0
write(22, "\201\36\6 X-+ X++ +++020JS9\201\37\247h +D +R "..., 43559) = 43559
lseek(22, 0, SEEK_SET)                  = 0
mmap(NULL, 43559, PROT_READ, MAP_SHARED, 22, 0) = 0x7f45ac8ef000
munmap(0x7f45ac8ef000, 43559)           = 0
close(22)                               = 0
write(23, "\201\36\f 03 04 4  BO DA DK FR PO RC S"..., 17820) = 17820
lseek(23, 0, SEEK_SET)                  = 0
mmap(NULL, 17820, PROT_READ, MAP_SHARED, 23, 0) = 0x7f45b1573000
munmap(0x7f45b1573000, 17820)           = 0
close(23)                               = 0
write(24, "\201\36\0370701.999 AGEAUTBARFLAFROID:IP"..., 44271) = 44271
lseek(24, 0, SEEK_SET)                  = 0
mmap(NULL, 44271, PROT_READ, MAP_SHARED, 24, 0) = 0x7f45ac8ef000
write(9, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 262126) = 262126
munmap(0x7f45ac8ef000, 44271)           = 0
close(24)                               = 0
write(25, "\201\36() C0 C0.00.10010551, 10010211"..., 74850) = 74850
lseek(25, 0, SEEK_SET)                  = 0
mmap(NULL, 74850, PROT_READ, MAP_SHARED, 25, 0) = 0x7f45ac43e000
munmap(0x7f45ac43e000, 74850)           = 0
close(25)                               = 0
write(11, "\201\35\1PLA\201\36\0101.31286AFCN9M TMIXQJRSI"..., 47363) = 47363
lseek(11, 0, SEEK_SET)                  = 0
mmap(NULL, 47363, PROT_READ, MAP_SHARED, 11, 0) = 0x7f45ac8ee000
write(2, "fatal error: Internal error: ass"..., 84) = 84                       <---------- Before this happened something
close(3)                                = 0
munmap(0x7f45ac90a000, 13082624)        = 0
exit_group(70)                          = ?
+++ exited with 70 +++

Ok, I have no idea, again.

rsto commented

I could not reproduce the assert failure with the email that you provided. Would you have the option to run a custom build of cyrus for debugging? Or do you know how to debug using gdb? There is just one function, dump_doc_list_docs, that calls the aborting function. I would like to debug its state when the assertion fails.

I could not reproduce the assert failure with the email that you provided.

Correct, after I opened the ticket and uploaded that email (469), I run into that situation, the assertion happened after the previous email (468).
The common with the attached (469) and the previous email (468) is: they were the last one in the folder when assertion happened. When I deleted the 469, on the next run the 468 become the last and the assertion happened after the last email processed (opened - closed).
So, that was a false lead.

Would you have the option to run a custom build of cyrus for debugging? Or do you know how to debug using gdb?
There is just one function, dump_doc_list_docs, that calls the aborting function. I would like to debug its state when the assertion fails.

It is possible, I already downloaded the source code (with apt source ...) and I checked that.
I found that, the assertion, where squatter died, is in the function squat_count_encode_I and it called it from dump_doc_list_docs as you mentioned.

I forgot, how did I use gdb, I guess I will figure out as last time :)

If you have any idea/patch, how to modify the sourcecode, just let me know, I try to do my best.

Anyway, if that helps you: when I run the squatter with more than 2 '-v', there will be no assertion issue.
like this one:

/usr/lib/cyrus/bin/squatter   -vvvvvvvvv  -p -u username

I did not checked the code, what is different in this case.

rsto commented

As a first measure, let's please run squatter with valgrind. That is:

  • install the valgrind executable if your system doesn't already include it. Almost any Linux package manager will already provide. Otherwise, you can get it from here: https://valgrind.org/
  • run squatter as you did before, but prepend the command line with valgrind, e.g. valgrind --leak-check=yes /usr/lib/cyrus/bin/squatter -v -p -u foo

This will report if there is any memory corruption going on.

errge commented

As this happened for me today, I wanted to provide a bit of input too:

  • I'm running 3.0.8-6+deb10u6 (yes, I know, I will upgrade)
  • The error message is exactly the same
  • It happens at the end of a mailbox
  • Removing the last message doesn't help
  • Strace is similar
  • only happens if using -i, full reindexing is fine (so I can just opt to do daily full reindex for now)
  • if the indexing stops at user X, and I do a run with -u X filter, the indexing finishes for that inbox, but the unfiltered all-user -i reindex asserts there again
  • but I also found a user, where individual -u X indexing fails too with -i, but not without -i
  • -vvvvvvvv does help for me too for an individual user
  • but doesn't help when running the full system with -i (but without any -u)

I also repeated problematic commands with valgrind --leak-check=yes, but no errors were printed apart from the assertion.

I'm a bit busy with the upgrade right now, if I manage to get myself up to 3.6 and the issue still remains, I will install debug symbols and give gdb a try.

I hope my findings are useful for somebody who has an understanding of the internals.

errge commented

Ohh, sorry for the spam, but just one vague idea: can this be somehow related to time? I mean, it's happening for me on a 3.0 branch that was released for years, but issue was not reported before by others, so... Maybe it's a Y2022M11 issue somehow? :)

In our old system, based on clearos 5.2 (maybe cyrus version 2.4.x) had this issue. I do not remember exactly, was it an assertion or not, but failed and I had to run squatter with each usernames explicitely to get squatter working more or less.

We put that old machine into archive, so, I do not really want to bring it up again :)

errge commented

@pongraczi btw, have you tried Xapian instead of squatter? Is it better or easier to operate?

@errge I tried xapian, but I think, I missed something very important/fundamental, because it was a horror show:

  • for me, the initial scan was incredible slow. That means, some thousand emails per day.
  • the generated *.glass files were huuuuuuuuuge, several hundred gigabytes for some thousands emails
  • I used it in rolling mode, I just pulled 1-2 emails into the folder, to see, is it working or not, I got files in gigabytes as index. So, the results was gigabyte for a 10-20kbyte email.

So, I was not able to figure out and find out, how xapian should work. I gave up xapian.

Regarding to squat, the manual said, it has poor performance in rolling mode. But what is poor performance? What does it do exactly? What's wrong with it? I have no idea yet.

I got valgrind result:

lot of working indexing mailbox blah/blah/blah...
cyrus/squatter[3647]: indexing mailbox user/elfogo/SpamBejelentes...
cyrus/squatter[3647]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[3647]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[3647]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
cyrus/squatter[3647]: unknown charset: charset=<charset="utf-8">
==3647== Conditional jump or move depends on uninitialised value(s)
==3647==    at 0x49201A0: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==3647==    by 0x491FF0B: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==3647==    by 0x4921944: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==3647==    by 0x491CE75: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==3647==    by 0x4909125: search_update_mailbox (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==3647==    by 0x10D07F: ??? (in /usr/lib/cyrus/bin/squatter)
==3647==    by 0x10C8D4: ??? (in /usr/lib/cyrus/bin/squatter)
==3647==    by 0x4BCB189: (below main) (libc_start_call_main.h:58)
==3647==
==3647== Conditional jump or move depends on uninitialised value(s)
==3647==    at 0x4921B5F: squat_count_encode_I (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==3647==    by 0x492025C: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==3647==    by 0x491FF0B: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==3647==    by 0x4921944: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==3647==    by 0x491CE75: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==3647==    by 0x4909125: search_update_mailbox (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==3647==    by 0x10D07F: ??? (in /usr/lib/cyrus/bin/squatter)
==3647==    by 0x10C8D4: ??? (in /usr/lib/cyrus/bin/squatter)
==3647==    by 0x4BCB189: (below main) (libc_start_call_main.h:58)
==3647==
fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0
==3647==
==3647== HEAP SUMMARY:
==3647==     in use at exit: 30,184,646 bytes in 15,634 blocks
==3647==   total heap usage: 131,391,265 allocs, 131,375,631 frees, 340,462,643,914 bytes allocated
==3647==
==3647== LEAK SUMMARY:
==3647==    definitely lost: 0 bytes in 0 blocks
==3647==    indirectly lost: 0 bytes in 0 blocks
==3647==      possibly lost: 0 bytes in 0 blocks
==3647==    still reachable: 30,184,646 bytes in 15,634 blocks
==3647==         suppressed: 0 bytes in 0 blocks
==3647== Reachable blocks (those to which a pointer was found) are not shown.
==3647== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==3647==
==3647== Use --track-origins=yes to see where uninitialised values come from
==3647== For lists of detected and suppressed errors, rerun with: -s
==3647== ERROR SUMMARY: 46 errors from 3 contexts (suppressed: 0 from 0)
==3647== could not unlink /tmp/vgdb-pipe-from-vgdb-to-3647-by-root-on-???
==3647== could not unlink /tmp/vgdb-pipe-to-vgdb-from-3647-by-root-on-???
==3647== could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-3647-by-root-on-???
root@cyrus3666:~# 
rsto commented

Very interesting! Now, would you have any chance to build cyrus with debugging info (e.g. pass -g to the compiler)?

rsto commented

I am curious what led your Xapian indexes blow up, BTW. Do you have lots of emails with machine-generated contents, or otherwise randomized data, such as git refs, ..? It's to be expected that an initial Xapian index is slow, especially if you index directly on slow disks. At Fastmail, we use the Xapian backend and we index in tiers, where the index first goes into a memory file system, then on fast NVME disks. Typically, indexing first on fast disks and then compressing on slower disks is recommended. There is somewhat dated but still relevant post that describes the setup we used successfully for a long time: https://fastmail.blog/historical/email-search-system/

errge commented

I've gdb and debugsymbols set up (easy for me, as I'm on Debian), do any of you know from the top of your head, how to break on assert?

fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0
[Inferior 1 (process 2663) exited with code 0113]
(gdb) bt
No stack.

So far, I've tried break abort and break __assert_fail, but inferior process still exits successfully and I have no bt.

Or maybe a conditional breakpoint at the line number on the condition that v64 < 0? What's the easiest?

UPDATE: I tried break imap/squat_internal.c:134 if v64 < 0 now, to no avail...

Very interesting! Now, would you have any chance to build cyrus with debugging info (e.g. pass -g to the compiler)?

Yes, but I need some time....

I am curious what led your Xapian indexes blow up, BTW.

These emails are business emails, nothing special, I guess. TBC.

rsto commented

@errge calling abort in assert is not elegant but quite useful to get a core

+++ b/lib/assert.c
@@ -54,5 +54,6 @@ assertionfailed(const char *file, int line, const char *expr)

     snprintf(buf, sizeof(buf), "Internal error: assertion failed: %s: %d%s%s",
             file, line, expr ? ": " : "", expr ? expr : "");
+    abort();
     fatal(buf, EX_SOFTWARE);
 }
errge commented
(gdb) bt
#0  0xb7fd4d61 in __kernel_vsyscall ()
#1  0xb7c10522 in __libc_signal_restore_set (set=0xbfffe2cc) at ../sysdeps/unix/sysv/linux/internal-signals.h:84
#2  __GI_raise (sig=6) at ../sysdeps/unix/sysv/linux/raise.c:48
#3  0xb7bfa2b6 in __GI_abort () at abort.c:79
#4  0xb7dcab34 in assertionfailed (file=0xb7f6b535 "imap/squat_internal.c", line=134, expr=0xb7f6b52c "v64 >= 0") at lib/assert.c:57
#5  0xb7f4b973 in squat_count_encode_I (v64=-1505723237) at imap/squat_internal.c:134
#6  0xb7f49efd in dump_doc_list_docs (docs=0x4435f0, index=0x412cc0) at imap/squat_build.c:1243
#7  write_trie_word_data (index=index@entry=0x412cc0, t=<optimized out>, len=len@entry=2, result_offset=<optimized out>) at imap/squat_build.c:1362
#8  0xb7f49b50 in write_trie_word_data (index=index@entry=0x412cc0, t=0xbff970, len=len@entry=3, result_offset=0xbffff394) at imap/squat_build.c:1341
#9  0xb7f4b555 in dump_index_trie_words (result_offset=0xbffff394, first_char=46, index=0x412cc0) at imap/squat_build.c:1464
#10 index_close_internal (index=index@entry=0x412cc0, OK=OK@entry=1) at imap/squat_build.c:1590
#11 0xb7f4b78e in squat_index_finish (index=0x412cc0) at imap/squat_build.c:1696
#12 0xb7f46c87 in end_mailbox (rx=0x40f460, mailbox=0xe816b4) at imap/search_squat.c:913
#13 0xb7f3a3b3 in search_update_mailbox (rx=0x40f460, mailbox=0xe816b4, flags=1) at imap/search_engines.c:240
#14 0x00403fc6 in index_one (name=0x40f0e0 "PRIVACY!PRIVACY.Trash", blocking=1) at imap/squatter.c:292
#15 0x00403571 in do_indexer (sa=0xbffffcbc, sa=0xbffffcbc) at imap/squatter.c:352
#16 main (argc=<optimized out>, argv=<optimized out>) at imap/squatter.c:1004

Please note that the source code that you have to compare this to is: https://packages.debian.org/source/buster/cyrus-imapd

BTW, just for your information: I'm currently playing on a snapshot of a system, so I can even do code modifications and data modifications. (Apart from version upgrade, as that would kill the purpose of my snapshot for now.)

UPDATE: I've added some fprintf statements to imap/squat_build.c:1243 to see what's happening, and this is the finish line that triggers (after a lot of similar messages of course):

run_seq_delta: 13
run_seq_delta << 1: 26
(run_seq_delta << 1) | 1: 27
run_seq_delta: 438
run_seq_delta << 1: 876
(run_seq_delta << 1) | 1: 877
run_seq_delta: 117
run_seq_delta << 1: 234
(run_seq_delta << 1) | 1: 235
run_seq_delta: 23
run_seq_delta << 1: 46
(run_seq_delta << 1) | 1: 47
run_seq_delta: 1394622029
run_seq_delta << 1: -1505723238
(run_seq_delta << 1) | 1: -1505723237

I hope this helps already, but as background info the user's mailbox is 5.4G and has 24555 files (including all files in all submailboxes).

@pongraczi are you running on i686? Because I'm, and in the source code this is simply an int, so maybe upgrading to amd64 would magicly solve this? Although, looking at the output scrolling for a couple of minutes, all the numbers were very small and this is the only big number at the crash, so still sounds like a bug, not just a running out of 32-bit issue.

rsto commented

The plot seems to thicken:

That bogus argument for squat_count_encode_I is the run_seq_delta variable in squat_build.c:1209 static int dump_doc_list_docs.

Now these lines look suspicious

1247                     run_seq_delta = doc->doc_ID - last_doc_ID;
1248                 }
1249                 last_doc_ID = doc->doc_ID;
1250                 doc = doc->next;

If any of of the two doc ids is read from unitialized memory this would explain valgrind reporting run_seq_delta to be uninitialized, too. Rerunning valgrind with the --track-origins=yes argument could confirm this.

And the doc_id could be bogus if

1250                 doc = doc->next;

makes doc point one after the valid document list in the next loop iteration. The fact that this issue seems to show up for the last message in a mailbox also suggests this.

I can't go further today than this armchair-debugging today. Hopefully we can sort this out soon.

errge commented

@rsto please note that I updated a comment with more info, and fprintf debugging. Just to make sure I understand you correctly could you just spell out the fprintfs that I could add to prove your hypothesis? Valgrind for some reason doesn't show anything for me.

Also is incremental generation consistent with your hypothesis, can we conclude that I'm safe if i just don't use -i for now?

I stopped cyrus-imapd and run squatter. The result is interesting: died earlier than before with other symptoms.

root@cyrus3666:~# valgrind --track-origins=yes --leak-check=yes /usr/lib/cyrus/bin/squatter -p -u  elfogo
==5992== Memcheck, a memory error detector
==5992== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==5992== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==5992== Command: /usr/lib/cyrus/bin/squatter -p -u elfogo
==5992== 
cyrus/squatter[5992]: indexing mailboxes
cyrus/squatter[5992]: indexing mailbox user/elfogo... 
cyrus/squatter[5992]: mailbox: longlock user.elfogo for 1.6 seconds
cyrus/squatter[5992]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[5992]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[5992]: ERROR: message has more than 1000 header lines not caching any more
==5992== Conditional jump or move depends on uninitialised value(s)
==5992==    at 0x49207F0: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x491E4BE: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x491EA17: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x491E8FE: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x491E8FE: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x491F6C1: squat_scan (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x4921815: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x491CE75: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x4909125: search_update_mailbox (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x10D07F: ??? (in /usr/lib/cyrus/bin/squatter)
==5992==    by 0x10C8D4: ??? (in /usr/lib/cyrus/bin/squatter)
==5992==    by 0x4BCB189: (below main) (libc_start_call_main.h:58)
==5992==  Uninitialised value was created by a heap allocation
==5992==    at 0x484582F: realloc (vg_replace_malloc.c:1437)
==5992==    by 0x4B37F5D: xrealloc (in /usr/lib/x86_64-linux-gnu/libcyrus_min.so.0.0.0)
==5992==    by 0x4920E07: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x491EF87: squat_search_list_docs (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x491DF59: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x4909002: search_update_mailbox (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==5992==    by 0x10D07F: ??? (in /usr/lib/cyrus/bin/squatter)
==5992==    by 0x10C8D4: ??? (in /usr/lib/cyrus/bin/squatter)
==5992==    by 0x4BCB189: (below main) (libc_start_call_main.h:58)
==5992== 
cyrus/squatter[5992]: indexing mailbox user/elfogo/BBBB... 
cyrus/squatter[5992]: mailbox: longlock user.elfogo.BBBB for 1.6 seconds
cyrus/squatter[5992]: indexing mailbox user/elfogo/XXXX... 
cyrus/squatter[5992]: mailbox: longlock user.elfogo.XXXX for 452.1 seconds
cyrus/squatter[5992]: indexing mailbox user/elfogo/XXXX/2018... 
cyrus/squatter[5992]: mailbox: longlock user.elfogo.XXXX.2018 for 14.6 seconds
cyrus/squatter[5992]: indexing mailbox user/elfogo/XXXX/2019... 
cyrus/squatter[5992]: mailbox: longlock user.elfogo.XXXX.2019 for 11.1 seconds
cyrus/squatter[5992]: indexing mailbox user/elfogo/XXXX/2020... 
cyrus/squatter[5992]: mailbox: longlock user.elfogo.XXXX.2020 for 42.3 seconds
cyrus/squatter[5992]: indexing mailbox user/elfogo/Airsea/Archive... 
==5992== Warning: set address range perms: large range [0x59c8b000, 0xbd9bc000) (defined)
==5992== Warning: set address range perms: large range [0xbd9bc040, 0x12a822420) (undefined)
Killed
root@cyrus3666:~# 

Technically this imap server does not receive any new messages, ever. So, except the delayed delete/expunge, there should be no changes in messages.

Now I will start cyrus-imapd again and start this again.

Ok, I got the same result, except, I started squatter with -v this time, so, some additional info appeared.

root@cyrus3666:~# valgrind --track-origins=yes --leak-check=yes /usr/lib/cyrus/bin/squatter -v -p -u  elfogo
==6571== Memcheck, a memory error detector
==6571== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==6571== Using Valgrind-3.19.0 and LibVEX; rerun with -h for copyright info
==6571== Command: /usr/lib/cyrus/bin/squatter -v -p -u elfogo
==6571== 
cyrus/squatter[6571]: indexing mailboxes
cyrus/squatter[6571]: indexing mailbox user/elfogo... 
Indexing mailbox user/elfogo... cyrus/squatter[6571]: mailbox: longlock user.elfogo for 1.7 seconds
cyrus/squatter[6571]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[6571]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[6571]: ERROR: message has more than 1000 header lines not caching any more
==6571== Conditional jump or move depends on uninitialised value(s)
==6571==    at 0x49207F0: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x491E4BE: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x491EA17: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x491E8FE: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x491E8FE: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x491F6C1: squat_scan (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x4921815: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x491CE75: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x4909125: search_update_mailbox (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x10D07F: ??? (in /usr/lib/cyrus/bin/squatter)
==6571==    by 0x10C8D4: ??? (in /usr/lib/cyrus/bin/squatter)
==6571==    by 0x4BCB189: (below main) (libc_start_call_main.h:58)
==6571==  Uninitialised value was created by a heap allocation
==6571==    at 0x484582F: realloc (vg_replace_malloc.c:1437)
==6571==    by 0x4B37F5D: xrealloc (in /usr/lib/x86_64-linux-gnu/libcyrus_min.so.0.0.0)
==6571==    by 0x4920E07: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x491EF87: squat_search_list_docs (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x491DF59: ??? (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x4909002: search_update_mailbox (in /usr/lib/x86_64-linux-gnu/libcyrus_imap.so.0.0.0)
==6571==    by 0x10D07F: ??? (in /usr/lib/cyrus/bin/squatter)
==6571==    by 0x10C8D4: ??? (in /usr/lib/cyrus/bin/squatter)
==6571==    by 0x4BCB189: (below main) (libc_start_call_main.h:58)
==6571== 
cyrus/squatter[6571]: squat: user.elfogo indexed 7337 messages (81749802 bytes) into 121614295 index bytes in 1354 seconds
cyrus/squatter[6571]: indexing mailbox user/elfogo/AMM... 
Indexing mailbox user/elfogo/AMM... cyrus/squatter[6571]: squat: user.elfogo.AMM indexed 0 messages (0 bytes) into 430523 index bytes in 2 seconds
cyrus/squatter[6571]: mailbox: longlock user.elfogo.AMM for 1.7 seconds
cyrus/squatter[6571]: indexing mailbox user/elfogo/XXXXX... 
Indexing mailbox user/elfogo/XXXXX... cyrus/squatter[6571]: squat: user.elfogo.XXXXX indexed 0 messages (0 bytes) into 204892924 index bytes in 464 seconds
cyrus/squatter[6571]: mailbox: longlock user.elfogo.XXXXX for 469.8 seconds
cyrus/squatter[6571]: indexing mailbox user/elfogo/XXXXX/2018... 
Indexing mailbox user/elfogo/XXXXX/2018... cyrus/squatter[6571]: squat: user.elfogo.XXXXX.2018 indexed 0 messages (0 bytes) into 5390174 index bytes in 15 seconds
cyrus/squatter[6571]: mailbox: longlock user.elfogo.XXXXX.2018 for 15.3 seconds
cyrus/squatter[6571]: indexing mailbox user/elfogo/XXXXX/2019... 
Indexing mailbox user/elfogo/XXXXX/2019... cyrus/squatter[6571]: squat: user.elfogo.XXXXX.2019 indexed 0 messages (0 bytes) into 4107748 index bytes in 11 seconds
cyrus/squatter[6571]: mailbox: longlock user.elfogo.XXXXX.2019 for 11.6 seconds
cyrus/squatter[6571]: indexing mailbox user/elfogo/XXXXX/2020... 
Indexing mailbox user/elfogo/XXXXX/2020... cyrus/squatter[6571]: squat: user.elfogo.XXXXX.2020 indexed 0 messages (0 bytes) into 16597902 index bytes in 43 seconds
cyrus/squatter[6571]: mailbox: longlock user.elfogo.XXXXX.2020 for 43.8 seconds
cyrus/squatter[6571]: indexing mailbox user/elfogo/XXXXX/Archive... 
Indexing mailbox user/elfogo/XXXXX/Archive... ==6571== Warning: set address range perms: large range [0x59c8b000, 0xbd9bc000) (defined)
==6571== Warning: set address range perms: large range [0xbd9bc040, 0x12a822420) (undefined)
Killed

I think, valgrind cause something, the result is not the same, as without it.

I mean:

  • with valgrind I got two different results when I used the two different valgrind invocations (--track-origins=yes --leak-check=yes vs. --leak-check=yes)
  • seems the --track-origins=yes caused the 2nd result
  • running squatter without valgrind, I got fast scan

I run squatter as cyrus invoking like this:
cyrus@cyrus3666:~$ /usr/lib/cyrus/bin/squatter -u elfogo

Now it finished the user's mailbox without problem. Time to have a drink, I guess.

So, now I run squatter as I used to via cron or manually for all of the mailboxes and now I got the assertion failed result, as expected. :)))))))

cyrus@cyrus3666:~$ /usr/lib/cyrus/bin/squatter
blahblablah
cyrus/squatter[7109]: indexing mailbox user/elfogo/Trash... 
cyrus/squatter[7109]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[7109]: ERROR: message has more than 1000 header lines not caching any more
fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0

Exactly where the original problem appeared.

Now I am too tired to summarize, maybe tomorrow.

Thanks.

Ok, I compiled with debug symbols and I run squatter as I used to.

root@cyrus3666:~# gdb /usr/lib/cyrus/bin/squatter

cyrus/squatter[30280]: indexing mailbox user/SSSS/SBBB870... 
cyrus/squatter[30280]: indexing mailbox user/SSSS/SYYYYY... 
cyrus/squatter[30280]: indexing mailbox user/SSSS/Sent... 
fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0
[Inferior 1 (process 30280) exited with code 0106]
(gdb) list
908	imap/squatter.c: No such file or directory.
(gdb) print
The history is empty.
(gdb) 

This mailbox also was a typical one, where squatter died all the time.

rsto commented

@pongraczi Please run the command from the source directory where you built Cyrus.

rsto commented

And apply this patch #4336 (comment)

rsto commented

I am currently sifting through the squat code. I still can't replicate the issue locally, I might need to load my test environment with tons of emails to replicate probably. I have a couple of guesses for the root cause, but it's always a pain with memory corruptions since they can mess up any data.

rsto commented

Could any of you please apply this patch and re-run the failing squatters?

diff --git a/imap/squat_build.c b/imap/squat_build.c
index 0d7f05896e..d7bd4e672d 100644
--- a/imap/squat_build.c
+++ b/imap/squat_build.c
@@ -661,6 +661,8 @@ static int add_to_table(SquatIndex *index, char const *data, int data_len,
     int ch;
     SquatWordTableEntry *e;

+    assert(data_len >= 2);
+
     while (data_len > 2) {
         /* Follow the branch node down to the next level of the trie. */
         ch = (unsigned char)data[0];

Sure, give me some minutes.
Just compiled with the previous one (+abort()), I run it just for curiosity.

Hmmm, seems not really useful for me (both patches applied):

cyrus/squatter[20370]: indexing mailbox user/elfogo/Trash... 
cyrus/squatter[20370]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[20370]: ERROR: message has more than 1000 header lines not caching any more

Program received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
44      ./nptl/pthread_kill.c: No such file or directory.
(gdb)

Regarding valgrind results yesterday: I am afraid, the oom-killer killed that process and that was why it died differently (I found kernel logs about it for yesterday evening). To confirm, it needs lot of time, due to that, the valgrind running is so slow and I also have other tasks.

Valgrind run ended now, aborted as expected. It took about 180 minutes from start to get this error. :)

There was no kernel oom-kill.

As I checked the output, it seems useful and

root@cyrus3666:~# valgrind --track-origins=yes --leak-check=yes /usr/lib/cyrus/bin/squatter

cyrus/squatter[21701]: ERROR: message has more than 1000 header lines not caching any more
==21701== Conditional jump or move depends on uninitialised value(s)
==21701==    at 0x49231A0: dump_doc_list_docs (squat_build.c:1234)
==21701==    by 0x49231A0: write_trie_word_data (squat_build.c:1362)
==21701==    by 0x4922F0B: write_trie_word_data (squat_build.c:1341)
==21701==    by 0x4924944: dump_index_trie_words (squat_build.c:1464)
==21701==    by 0x4924944: index_close_internal (squat_build.c:1590)
==21701==    by 0x491FE75: end_mailbox (search_squat.c:933)
==21701==    by 0x490C125: search_update_mailbox (search_engines.c:261)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701==  Uninitialised value was created by a heap allocation
==21701==    at 0x484582F: realloc (vg_replace_malloc.c:1437)
==21701==    by 0x4B3AF3D: xrealloc (xmalloc.c:79)
==21701==    by 0x4923E07: doc_ID_map_add (squat_build.c:355)
==21701==    by 0x4923E07: copy_docIDs (squat_build.c:379)
==21701==    by 0x4921F87: squat_search_list_docs (squat.c:167)
==21701==    by 0x4920F59: begin_mailbox (search_squat.c:832)
==21701==    by 0x490C002: search_update_mailbox (search_engines.c:221)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701== 
==21701== Conditional jump or move depends on uninitialised value(s)
==21701==    at 0x4924B5F: squat_count_encode_I (squat_internal.c:134)
==21701==    by 0x492325C: dump_doc_list_docs (squat_build.c:1244)
==21701==    by 0x492325C: write_trie_word_data (squat_build.c:1362)
==21701==    by 0x4922F0B: write_trie_word_data (squat_build.c:1341)
==21701==    by 0x4924944: dump_index_trie_words (squat_build.c:1464)
==21701==    by 0x4924944: index_close_internal (squat_build.c:1590)
==21701==    by 0x491FE75: end_mailbox (search_squat.c:933)
==21701==    by 0x490C125: search_update_mailbox (search_engines.c:261)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701==  Uninitialised value was created by a heap allocation
==21701==    at 0x484582F: realloc (vg_replace_malloc.c:1437)
==21701==    by 0x4B3AF3D: xrealloc (xmalloc.c:79)
==21701==    by 0x4923E07: doc_ID_map_add (squat_build.c:355)
==21701==    by 0x4923E07: copy_docIDs (squat_build.c:379)
==21701==    by 0x4921F87: squat_search_list_docs (squat.c:167)
==21701==    by 0x4920F59: begin_mailbox (search_squat.c:832)
==21701==    by 0x490C002: search_update_mailbox (search_engines.c:221)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701== 
==21701== Conditional jump or move depends on uninitialised value(s)
==21701==    at 0x4924B68: squat_count_encode_I (squat_internal.c:136)
==21701==    by 0x492325C: dump_doc_list_docs (squat_build.c:1244)
==21701==    by 0x492325C: write_trie_word_data (squat_build.c:1362)
==21701==    by 0x4922F0B: write_trie_word_data (squat_build.c:1341)
==21701==    by 0x4924944: dump_index_trie_words (squat_build.c:1464)
==21701==    by 0x4924944: index_close_internal (squat_build.c:1590)
==21701==    by 0x491FE75: end_mailbox (search_squat.c:933)
==21701==    by 0x490C125: search_update_mailbox (search_engines.c:261)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701==  Uninitialised value was created by a heap allocation
==21701==    at 0x484582F: realloc (vg_replace_malloc.c:1437)
==21701==    by 0x4B3AF3D: xrealloc (xmalloc.c:79)
==21701==    by 0x4923E07: doc_ID_map_add (squat_build.c:355)
==21701==    by 0x4923E07: copy_docIDs (squat_build.c:379)
==21701==    by 0x4921F87: squat_search_list_docs (squat.c:167)
==21701==    by 0x4920F59: begin_mailbox (search_squat.c:832)
==21701==    by 0x490C002: search_update_mailbox (search_engines.c:221)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701== 
==21701== Conditional jump or move depends on uninitialised value(s)
==21701==    at 0x4924BB3: squat_count_encode_I (squat_internal.c:137)
==21701==    by 0x492325C: dump_doc_list_docs (squat_build.c:1244)
==21701==    by 0x492325C: write_trie_word_data (squat_build.c:1362)
==21701==    by 0x4922F0B: write_trie_word_data (squat_build.c:1341)
==21701==    by 0x4924944: dump_index_trie_words (squat_build.c:1464)
==21701==    by 0x4924944: index_close_internal (squat_build.c:1590)
==21701==    by 0x491FE75: end_mailbox (search_squat.c:933)
==21701==    by 0x490C125: search_update_mailbox (search_engines.c:261)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701==  Uninitialised value was created by a heap allocation
==21701==    at 0x484582F: realloc (vg_replace_malloc.c:1437)
==21701==    by 0x4B3AF3D: xrealloc (xmalloc.c:79)
==21701==    by 0x4923E07: doc_ID_map_add (squat_build.c:355)
==21701==    by 0x4923E07: copy_docIDs (squat_build.c:379)
==21701==    by 0x4921F87: squat_search_list_docs (squat.c:167)
==21701==    by 0x4920F59: begin_mailbox (search_squat.c:832)
==21701==    by 0x490C002: search_update_mailbox (search_engines.c:221)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701== 
==21701== Conditional jump or move depends on uninitialised value(s)
==21701==    at 0x4924BCD: squat_count_encode_I (squat_internal.c:141)
==21701==    by 0x492325C: dump_doc_list_docs (squat_build.c:1244)
==21701==    by 0x492325C: write_trie_word_data (squat_build.c:1362)
==21701==    by 0x4922F0B: write_trie_word_data (squat_build.c:1341)
==21701==    by 0x4924944: dump_index_trie_words (squat_build.c:1464)
==21701==    by 0x4924944: index_close_internal (squat_build.c:1590)
==21701==    by 0x491FE75: end_mailbox (search_squat.c:933)
==21701==    by 0x490C125: search_update_mailbox (search_engines.c:261)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701==  Uninitialised value was created by a heap allocation
==21701==    at 0x484582F: realloc (vg_replace_malloc.c:1437)
==21701==    by 0x4B3AF3D: xrealloc (xmalloc.c:79)
==21701==    by 0x4923E07: doc_ID_map_add (squat_build.c:355)
==21701==    by 0x4923E07: copy_docIDs (squat_build.c:379)
==21701==    by 0x4921F87: squat_search_list_docs (squat.c:167)
==21701==    by 0x4920F59: begin_mailbox (search_squat.c:832)
==21701==    by 0x490C002: search_update_mailbox (search_engines.c:221)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701== 
==21701== Conditional jump or move depends on uninitialised value(s)
==21701==    at 0x4924BDA: squat_count_encode_I (squat_internal.c:143)
==21701==    by 0x492325C: dump_doc_list_docs (squat_build.c:1244)
==21701==    by 0x492325C: write_trie_word_data (squat_build.c:1362)
==21701==    by 0x4922F0B: write_trie_word_data (squat_build.c:1341)
==21701==    by 0x4924944: dump_index_trie_words (squat_build.c:1464)
==21701==    by 0x4924944: index_close_internal (squat_build.c:1590)
==21701==    by 0x491FE75: end_mailbox (search_squat.c:933)
==21701==    by 0x490C125: search_update_mailbox (search_engines.c:261)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701==  Uninitialised value was created by a heap allocation
==21701==    at 0x484582F: realloc (vg_replace_malloc.c:1437)
==21701==    by 0x4B3AF3D: xrealloc (xmalloc.c:79)
==21701==    by 0x4923E07: doc_ID_map_add (squat_build.c:355)
==21701==    by 0x4923E07: copy_docIDs (squat_build.c:379)
==21701==    by 0x4921F87: squat_search_list_docs (squat.c:167)
==21701==    by 0x4920F59: begin_mailbox (search_squat.c:832)
==21701==    by 0x490C002: search_update_mailbox (search_engines.c:221)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701== 
==21701== 
==21701== Process terminating with default action of signal 6 (SIGABRT)
==21701==    at 0x4C31CCC: __pthread_kill_implementation (pthread_kill.c:44)
==21701==    by 0x4BE2EF1: raise (raise.c:26)
==21701==    by 0x4BCD471: abort (abort.c:79)
==21701==    by 0x4B2FCB1: assertionfailed (assert.c:57)
==21701==    by 0x4924BFF: squat_count_encode_I (squat_internal.c:134)
==21701==    by 0x492325C: dump_doc_list_docs (squat_build.c:1244)
==21701==    by 0x492325C: write_trie_word_data (squat_build.c:1362)
==21701==    by 0x4922F0B: write_trie_word_data (squat_build.c:1341)
==21701==    by 0x4924944: dump_index_trie_words (squat_build.c:1464)
==21701==    by 0x4924944: index_close_internal (squat_build.c:1590)
==21701==    by 0x491FE75: end_mailbox (search_squat.c:933)
==21701==    by 0x490C125: search_update_mailbox (search_engines.c:261)
==21701==    by 0x10D07F: index_one (squatter.c:393)
==21701==    by 0x10C8D4: do_indexer (squatter.c:454)
==21701==    by 0x10C8D4: main (squatter.c:1218)
==21701== 
==21701== HEAP SUMMARY:
==21701==     in use at exit: 44,371,796 bytes in 19,925 blocks
==21701==   total heap usage: 23,934,318 allocs, 23,914,393 frees, 65,566,609,381 bytes allocated
==21701== 
==21701== LEAK SUMMARY:
==21701==    definitely lost: 0 bytes in 0 blocks
==21701==    indirectly lost: 0 bytes in 0 blocks
==21701==      possibly lost: 0 bytes in 0 blocks
==21701==    still reachable: 44,371,796 bytes in 19,925 blocks
==21701==         suppressed: 0 bytes in 0 blocks
==21701== Reachable blocks (those to which a pointer was found) are not shown.
==21701== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==21701== 
==21701== For lists of detected and suppressed errors, rerun with: -s
==21701== ERROR SUMMARY: 23 errors from 7 contexts (suppressed: 0 from 0)
==21701== could not unlink /tmp/vgdb-pipe-from-vgdb-to-21701-by-root-on-???
==21701== could not unlink /tmp/vgdb-pipe-to-vgdb-from-21701-by-root-on-???
==21701== could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-21701-by-root-on-???
Aborted

Unfortunately I was not able to use gdb to get any info, I have not enough experience with gdb. Sorry, my bad.

rsto commented

I haven't had time to look into this further and I do not expect to change that anytime soon. We do use xapian at Fastmail and recommend to use it. I can understand it's more complex to set up, but hopefully the link I provided #4336 (comment) helps. We want to make it simpler or at least document more, but for the time being that describes best the setup that we successfully use at Fastmail.
Or course, I anyone is willing to fix this issue with squat and comes up with a pull request we'll happily merge it.

Thank you Robert, I understand.
Regarding xapian, I set up it in a test environment (copy of production), even it seemed set up was not really hard, but it was extremely slow and produced unbelivable huge *glass files even for some mails (some tens kbytes email -> some tens gigabytes *glass file) about ordenary emails (not a git commit or other like this).

When I wrote slow, I mean some thousands emails / day and even the storage is a 6 drives SSD as raidz2, I don't think, that was the root cause.

I think I will open a xapian thread after I try again, maybe set up xapian not as easy I thought :)

I keep this issue open, I will try to dig deeper.

Quick question: is it possible, folders with more than 32768 emails inside could cause issues during squatter process?

Some results:

I put some diagnostic code into the function dump_doc_list_docs to dump variables just before the assertion happens.
I started the squatter with strace (strace -o naploka.log ./squatter) and soon the assertion failed happened.

cyrus/squatter[18959]: indexing mailbox user/elfogo/Trash... 
cyrus/squatter[18959]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[18959]: ERROR: message has more than 1000 header lines not caching any more
1 ((run_seq_delta << 1) | 1) < 0
run_seq_count: 1 , run_seq_delta: -21930 , run_size: 16
i: 65 , doc_count: 15 , last_doc_ID: 157 
fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0

It died here with the variables dumped above:

cyrus-imapd/imap/squat_build.c

Lines 1242 to 1244 in fb975e7

run_size +=
squat_count_encode_I((run_seq_delta << 1) |
1);

As I opened the strace log (please note, the strace log file compressed, otherwise it is about 75MB), I found the location, where it happened, but interestengly, the last opened email file was the last email in that folder, so, technically that folder was completed and the process died right after it.

UPDATE:
squat files in that folder looks like this:

-rw------- 1 cyrus mail 8232233 Feb  3 22:46 /var/spool/cyrus/mail/uuid/j/9/j9x6ytfm3ain3r60yd9zcemw/cyrus.squat
-rw------- 1 cyrus mail  785909 Feb  4 00:03 /var/spool/cyrus/mail/uuid/j/9/j9x6ytfm3ain3r60yd9zcemw/cyrus.squat.NEW

An other assertion, with the same condition as previously ((run_seq_delta << 1) | 1):

cyrus/squatter[2445]: indexing mailbox user/csaba/Sent... 
1 ((run_seq_delta << 1) | 1) < 0
run_seq_count: 1 , run_seq_delta: 2121408881 , run_size: 30932
i: 65 , doc_count: 30852 , last_doc_ID: 2121735248 
fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0

This folder seems scanned in the previous time without problem.
This folder has 32827 emails inside and seems all of them scanned (based on strace log + filesystem check).

-rw------- 1 cyrus mail 100374969 Feb  3 23:53 cyrus.squat
-rw------- 1 cyrus mail  18660677 Feb  4 18:03 cyrus.squat.NEW

After I deleted the previously created cyrus.squat and cyrus.squat.NEW files in the folders where assertion failed, I got squatter run further now.
Assertion happened on exactly the same condition as the previous two.
Variables shown below.

cyrus/squatter[2566]: indexing mailbox user/irony/Archives/2010... 
1 ((run_seq_delta << 1) | 1) < 0
run_seq_count: 1 , run_seq_delta: -307917535 , run_size: 2620
i: 65 , doc_count: 2621 , last_doc_ID: -307895120 
fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0

squat files:

-rw------- 1 cyrus mail 7882913 Feb  3 21:32 /var/spool/cyrus/mail/uuid/b/y/by13zw6graiv4opvxmnf5jp5/cyrus.squat
-rw------- 1 cyrus mail 1307028 Feb  4 18:34 /var/spool/cyrus/mail/uuid/b/y/by13zw6graiv4opvxmnf5jp5/cyrus.squat.NEW

Now it happened again at the folder I already reported 11 hours ago. The variables are different, but close enough:

cyrus/squatter[3732]: indexing mailbox user/elfogo/Trash... 
cyrus/squatter[3732]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[3732]: ERROR: message has more than 1000 header lines not caching any more
1 ((run_seq_delta << 1) | 1) < 0
run_seq_count: 1 , run_seq_delta: -21792 , run_size: 15
i: 65 , doc_count: 14 , last_doc_ID: 123 
fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0
-rw------- 1 cyrus mail 8208672 Feb  4 18:29 /var/spool/cyrus/mail/uuid/j/9/j9x6ytfm3ain3r60yd9zcemw/cyrus.squat
-rw------- 1 cyrus mail  785850 Feb  4 20:26 /var/spool/cyrus/mail/uuid/j/9/j9x6ytfm3ain3r60yd9zcemw/cyrus.squat.NEW

At least now it consistens, at the same place, almost with the same numbers.

cyrus/squatter[4241]: indexing mailbox user/elfogo/Trash... 
cyrus/squatter[4241]: ERROR: message has more than 1000 header lines not caching any more
cyrus/squatter[4241]: ERROR: message has more than 1000 header lines not caching any more
1 ((run_seq_delta << 1) | 1) < 0
run_seq_count: 1 , run_seq_delta: -21943 , run_size: 15
i: 65 , doc_count: 14 , last_doc_ID: 123 
fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0

At this moment I stop posting these reports.
If anybody has any idea, let me know :)

New week, new test case:

  • I deleted all the squat files from the system manually, including cyrus.squat and cyrus.squat.NEW files, while the system was up and running (I mean cyrus itself)
  • I run a squatter -v as user cyrus, which completed after 275 minutes later, seems without any assertion, just finished the job as expected
  • after about 24 hours, I run a new full index by the command: squatter -v
    • please note, this is a cloned test system, there are no new messages, maybe some emails were deleted, due to delayed expunge (120 days)
  • now it exited after 4 minutes with assertion failed message (this squatter was not the patched one, I have no dump yet)
    • the location, where it produced the assertion failed event is completely different than the earlier cases, after csaba/sent but before elfogo/Trash
  • I run the patched squatter in the 3rd time: ./squatter -v , I got the assertion failed problem in the exact location as in the previous case, now I have variables dump (the condition is exactly the same and run_seq_delta is definitely a negative number):
1 ((run_seq_delta << 1) | 1) < 0
run_seq_count: 1 , run_seq_delta: -72801 , run_size: 8969
i: 65 , doc_count: 8967 , last_doc_ID: 7266 
fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0
  • I ran an incremental ./squatter -i -v, I got the exactly same result as shown above

So, in short:

  • 1st run finished as expected
  • 2nd run assertion failed at mailbox X
  • 3rd run assertion failed again at the exact same mailbox as in the 2nd run
  • 4th run as incremental squatter failed with assertion with the exact same result at the exact same mailbox as shown earlier
  • there were no new or expunged messages, nothing changed (I checked the logs, back to 5 days)
    • Log: cyrus/cyr_expire[2778]: Expired 0 and expunged 0 out of 6450856 messages from 17991 mailboxes

Unfortunately this problem still exists in my system.
v3.6.1-4 on debian testing/updates

Is there any trick, how to catch this issue? Or any idea what should cause this kind of error?

Character encoding in the body? In the header?

Test case:

I started squatter for a specific user from his top level mailbox, where I got the assertion failed at the user/john/Sent mailbox.
When I started the squatter for that specific mailbox, it completed without any issue (I did not use any -v).

Case 1 - toplevel mailbox
root@cyrus3666:~# /usr/lib/cyrus/bin/squatter  -r  user/john
cyrus/squatter[40009]: indexing mailbox user/john/Sent... 
fatal error: Internal error: assertion failed: imap/squat_internal.c: 134: v64 >= 0

The result is in the file, called john2_toplevel.txt

Case 2 - Sent mailbox

root@cyrus3666:~# /usr/lib/cyrus/bin/squatter    user/john/Sent
cyrus/squatter[40050]: indexing mailboxes
cyrus/squatter[40050]: indexing mailbox user/john/Sent... 
cyrus/squatter[40050]: done indexing mailboxes

The result is in the file, called john2_sent.txt

Compare the results

As shown on the screenshot below, the assertion happened at a specific position in case 1, while the squatter just passes the very same point in case 2. I can repeat case 1 and case 2 in any order or in any time, the result is the same.
Kijelölés_391

Please note, the cyrus/squatter[48129]: squat: opening/writing lines are exactly the same in both cases, only difference is, when only the "problematic" mailbox being squattered will be finished as expected.

Any idea welcome :)

@pongraczi We had a similar issue with squatter and cyrus-imapd 3.4.
I don't know the cause of the problem but this attempt solved it (notes adapted to your case) :

cd user/john/Sent
rm cyrus.*
reconstruct -r -f user/john/Sent
squatter -v -r user/john

Thank you!
I will give it a try :)
I do not remember exactly, in the past did I already done it or not.

Thank you again for your hint!
I started to test this on my system, takes a lot of time (several full squattering on 1TB+ emails).

As my 3.6 mailboxes converted to the new storage format, first one need to determine the filesystem path to the mailbox. So, just for the record:

cd `/usr/lib/cyrus/bin/mbpath "user/john/Sent"`
rm cyrus.*
reconstruct -r -f user/john/Sent
squatter -v -r user/john

Ok, so, as I remembered for older tests, in my case the problem is that

  • deleted cyrus.* and reconstructed the problematic mailbox,
  • first full squatter ( squatter -p) just passes the mailbox, created the squat file
  • on the 2nd full squatter run, the assertion happened again
  • interestengly, there were problematic folder (which was fixed already) before this mailbox which did not produce the assertion anymore, while an other mailbox fails on every full squatter

Test case, steps:

  • full squattering -> squatter -p assertion failed in one mailbox, after squattering hundreds of mailboxes (even thousands)
  • the mailbox located (mbpath) -> cyrus.* deleted
  • the mailbox directly reconstructed as shown above, I got the following message at the end:
    Wrong uidvalidity in mbentry, fixing user.dorfi.Archives.Elk&APw-ld&APY-tt (1685081841 -> 1685087889)
    I got always this after every reconstructing process (I did several times).
  • squattering the mailbox directly -> squatter -p user/mailbox-path -> successfully finished
  • squattering again (2nd run) the mailbox directly -> squatter -p user/mailbox-path -> successfully finished
  • squattering again incrementally (3rd run) the mailbox directly -> squatter -p -i user/mailbox-path -> successfully finished
  • squattering again incrementally (4th run) the mailbox directly -> squatter -p -i user/mailbox-path -> successfully finished
  • full squattering using the formula squatter -p -> assertion failed exactly at the same mailbox

The result above does no true for all the previously problematic mailboxes.
Still playing.

Finally, there are results.

Thanks for the comments by @neverexists and @Nucleus-MFelettigh I went trough again on my system.
For the record, all emails synced from an old cyrus to a new one and converted to uuid. The used tool was imapsync.

So, there were about 10 mailboxes, where squatter died with assertion failed, especially in the 2nd run.
In some cases the steps described above (rm cyrus.*, reconstruct , etc.) solved the situation, while in other cases it did not solve the issue.
The latter cases I found that, there were some emails, which had Message-Id: <xxx@imapsync> , which I deleted and the problem just gone.
There were one or two emails, where there were no body, nothing, except the Message-Id: <xxx@imapsync>.
Anyway, as they were old emails, with no value contents, I deleted these emails, reconstructed these mailboxes -> problems gone.

I do not know, what caused these strange messages, maybe the original emails were damaged in the past (2008-2010), when the underlaying storage was something different and probably the ext2/3/4 were damaged.

So, after a lot of time, it seems our original problems caused by some emails in some conditions.

Thank you for all your attention and helps.

I close this ticket, as the mystery solved.