extreamly slow booting

Question

extreamly slow booting

jasl opened this issue 2 years ago · 9 comments

I'd catch nperf data from my Azure VM (premium SSD), and the node running in a Docker container, nperf running in the container too.

The current status is the node is stuck on booting, and it is not because of Cumulus reason, and the node is fresh sync last week (stop when speed reduced to 0 bps)

Here are flaming graphs I record the boot starting

flaminggraph.zip

the file system is ext4 without any configuration, the node is --state-pruning=archive-canonical --blocks-pruning=archive-canonical

Answer 1 · 2023-05-03T07:39:12.000Z

This may happen if there was an abnormal shutdown and fsync is slow for whatever reason. The log file queue may grow to gigabytes and it will take a long time to re-apply all logs on startup. It has been fixed in 0.4.7. The log queue is now limited to about 256Mb.

Answer 2 · 2023-05-14T20:12:40.000Z

Upgraded latest parity-db, the slow boot problem still exists (it takes about 8 minutes to start the node which using Azure premium SSD), here's another flaming graph

boot.svg.zip

UPDATE: after a very slow boot, re-run the node and it finish booting nearly instantly, what's the magic?

Answer 3 · 2023-05-14T20:26:25.000Z

exit.svg.zip

Here's another flaming graph for the node stuck on exiting

Answer 4 · 2023-05-15T07:00:16.000Z

Upgraded latest parity-db, the slow boot problem still exists (it takes about 8 minutes to start the node which using Azure premium SSD), here's another flaming graph

Are you sure this database was created with parity-db 0.4.7? Previous versions created a long queue of write-ahead log files that would take a long time to process on start. So if you already a database created by an older version and open it with 0.4.7, it will be slow the first time.

Answer 5 · 2023-05-15T07:19:43.000Z

Upgraded latest parity-db, the slow boot problem still exists (it takes about 8 minutes to start the node which using Azure premium SSD), here's another flaming graph

Are you sure this database was created with parity-db 0.4.7? Previous versions created a long queue of write-ahead log files that would take a long time to process on start. So if you already a database created by an older version and open it with 0.4.7, it will be slow the first time.

I can confirm I'm using 0.4.7, and I have removed the old database before I do the new sync.

The boot time is faster than the previous version (the previous will take more than half an hour to start, 0.4.7 about 7 minutes)

Answer 6 · 2023-05-15T07:32:05.000Z

What does ls -alh log* show inside chains/polkadot/paritydb/full/ dir before the slow start?

Answer 7 · 2023-05-15T07:37:19.000Z

ls -alh chains/khala/paritydb/full/ | grep log
-rw-r--r-- 1 root root     0 May 14 20:21 log0
-rw-r--r-- 1 root root   65M May 14 20:15 log1
-rw-r--r-- 1 root root   65M May 14 20:15 log2
-rw-r--r-- 1 root root   65M May 14 20:15 log3
-rw-r--r-- 1 root root   65M May 14 20:15 log4
-rw-r--r-- 1 root root   65M May 14 20:15 log5
-rw-r--r-- 1 root root   65M May 14 20:15 log6
-rw-r--r-- 1 root root   65M May 14 20:21 log7

ls -alh polkadot/chains/ksmcc3/paritydb/full/ | grep log
-rw-r--r-- 1 root root   57M May 14 20:19 log0
-rw-r--r-- 1 root root   75M May 14 20:15 log1
-rw-r--r-- 1 root root   65M May 14 20:15 log2
-rw-r--r-- 1 root root   88M May 14 20:15 log3
-rw-r--r-- 1 root root   65M May 14 20:15 log4
-rw-r--r-- 1 root root   65M May 14 20:16 log5
-rw-r--r-- 1 root root   72M May 14 20:16 log6

Answer 8 · 2023-05-15T07:45:37.000Z

e4defrag 1.44.1 (24-Mar-2018)
<Fragmented files>                             now/best       size/ext
1. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/index_06_16
                                              2077/1              8 KB
2. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_08_cc
                                                 2/1              8 KB
3. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_08_cf
                                                 2/1              8 KB
4. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_06_ce
                                                 2/1              8 KB
5. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_08_d3
                                                 2/1              8 KB

 Total/best extents				17099/1270
 Average size per extent			5411 KB
 Fragmentation score				1
 [0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
 This directory (/data) does not need defragmentation.
 Done.

Answer 9 · 2023-05-15T07:45:55.000Z

This looks normal. If it takes 8 minutes to fsync ~500Mb to the disk, you might be hitting IOPS limits on the disk or VM. Check your VM configuration or try on a different hardware.