Server crashing after ungraceful shutdown
Closed this issue · 12 comments
Hi guys,
I am running the latest version of tt compiled from source on ARM7 (32bit) armbian variant. All was working well until today morning.
overnight my client PC with Putty terminal shutdown, however in the morning I was not able to boot up the tt
nonix@orangepizero:~/ticktock$ bin/tt -c conf/tt.conf
TickTockDB v0.20.1, Maintained by
Yongtao You (yongtao.you@gmail.com) and Yi Lin (ylin30@gmail.com).
This program comes with ABSOLUTELY NO WARRANTY. It is free software,
and you are welcome to redistribute it under certain conditions.
For details, see <https://www.gnu.org/licenses/>.
Writing to log file: /var/log/ticktock/ticktock.log
bin/tt(+0x8754)[0x42f754]
/lib/arm-linux-gnueabihf/libc.so.6(+0x2e1e0)[0xb6c271e0]
bin/tt(+0x30fcc)[0x457fcc]
bin/tt(+0x41350)[0x468350]
bin/tt(+0x4bbee)[0x472bee]
bin/tt(+0x4c208)[0x473208]
bin/tt(+0x359da)[0x45c9da]
bin/tt(+0x6c70)[0x42dc70]
/lib/arm-linux-gnueabihf/libc.so.6(+0x1e31a)[0xb6c1731a]
/lib/arm-linux-gnueabihf/libc.so.6(__libc_start_main+0x5d)[0xb6c173ca]
Interrupted (11), shutting down...
The log did not have any valuable info, it even did not get to the point to write anything in there, last record was from the night before.
The way I have solved the issue was to delete the data folder and start all over.
I am new to TSDB and trying to get something simple lightweight up and running for my IoT thermometer.
Perhaps this is something what I did wrong or easy to fix knowing the internals ...
Anyway, thank you for your product.
N.
hi @nonix Thanks for reporting the critical bug. The '11' code means segmentation fault, likely a pointer exception somewhere. It will be hard to debug without repro steps or debug level core dump. I wonder if you could kindly help us?
The best choice is to run TT in GDB in debug mode, if you were familiar with GDB.
If you can repro the crash, could u please share the whole data folder with us? Since TT can't even restart, the data might be corrupted in binary format. I would adjust the log level to debug in tt.conf before restart, and see what files causing the problem. If I delete the files, can TT restart?
It seems you ran TT on orange-pi-zero 32bit. What specific OS version do u use?
Yah, I know, The problem is that I have deleted the data set, which have caused the issue. I am so sorry, (not a good idea to act first then to think :-( ). Anyway, put this on ice for the moment, however as soon as I hit the same issue again will report back.
Thank you guys,
N.
nonix@orangepizero:~/ticktock$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Armbian 24.11.0-trunk.283 bookworm
Release: 12
Codename: bookworm
nonix@orangepizero:~/ticktock$ uname -a
Linux orangepizero 6.6.54-current-sunxi #2 SMP Fri Oct 4 14:30:05 UTC 2024 armv7l GNU/Linux
glibc version: 2.36
@nonix I can't find the same OS as yours for my orangepi-zero-2 so I just pick the default one on armbian.com. Note that it is 64bit instead of 32bit. Not sure if I can repro your scenario.
To confirm, did you mean that
- you remotely ran TT in a putty (ssh) window,
- the putty window shutdown by itself overnight so it caused TT ungracefully shutdown (since you didn't run TT with nohup, I guess),
- then in the next morning you couldn't even restart TT from the same data folder?
- after you remove all data, then TT can restart.
FYI, here is the os I am testing on my orangepi-zero-2.
ylin30@orangepizero2:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description: Armbian 24.5.3 bookworm
Release: 12
Codename: bookworm
ylin30@orangepizero2:~$ uname -a
Linux orangepizero2 6.6.31-current-sunxi64 #1 SMP Fri May 17 10:02:40 UTC 2024 aarch64 GNU/Linux
ylin30@orangepizero2:~$
Hi, you have got it 100% right.
- Yes
- Yes
- Yes
- Yes
Thank you kindly for looking into the problem
@nonix I can't find the same OS as yours for my orangepi-zero-2 so I just pick the default one on armbian.com. Note that it is 64bit instead of 32bit. Not sure if I can repro your scenario.
To confirm, did you mean that
- you remotely ran TT in a putty (ssh) window,
- the putty window shutdown by itself overnight so it caused TT ungracefully shutdown (since you didn't run TT with nohup, I guess),
- then in the next morning you couldn't even restart TT from the same data folder?
- after you remove all data, then TT can restart.
FYI, here is the os I am testing on my orangepi-zero-2.
ylin30@orangepizero2:~$ lsb_release -a No LSB modules are available. Distributor ID: Debian Description: Armbian 24.5.3 bookworm Release: 12 Codename: bookworm ylin30@orangepizero2:~$ uname -a Linux orangepizero2 6.6.31-current-sunxi64 #1 SMP Fri May 17 10:02:40 UTC 2024 aarch64 GNU/Linux ylin30@orangepizero2:~$
You are running it on 64 bit ARM, while my is 32bit (ARM7) I have an old orange pi zero: http://www.orangepi.org/html/hardWare/computerAndMicrocontrollers/details/Orange-Pi-Zero.html
Update: We finally got a repro (not exactly the same but similar) in an ungrateful shutdown. @ytyou is looking at this actively.
@nonix What is your #openfiles? You can do
ulimit -a
and find out in the output.
nonix@orangepizero:~$ ulimit -aH
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 2994
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 65535
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) unlimited
max user processes (-u) 2994
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
nonix@orangepizero:~$ ulimit -a
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 2994
max locked memory (kbytes, -l) 65536
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 2994
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
@nonix Your #openfiles is 65536. It looks ok to me. If it is too low (e.g., 1024 by default), TT may crash due to short of file handlers.
We have identified and fixed a bug in Write Ahead Logging (WAL). The fix is in branch debug-wal. We added a unit test and are doing stress tests in these days. So far we haven't found repros anymore. Hope the fix will be in the next minor release.