extreamly slow booting
jasl opened this issue · 9 comments
Come from paritytech/polkadot-sdk#13
I'd catch nperf data from my Azure VM (premium SSD), and the node running in a Docker container, nperf running in the container too.
The current status is the node is stuck on booting, and it is not because of Cumulus reason, and the node is fresh sync last week (stop when speed reduced to 0 bps)
Here are flaming graphs I record the boot starting
the file system is ext4 without any configuration, the node is --state-pruning=archive-canonical --blocks-pruning=archive-canonical
This may happen if there was an abnormal shutdown and fsync
is slow for whatever reason. The log file queue may grow to gigabytes and it will take a long time to re-apply all logs on startup. It has been fixed in 0.4.7
. The log queue is now limited to about 256Mb.
Upgraded latest parity-db, the slow boot problem still exists (it takes about 8 minutes to start the node which using Azure premium SSD), here's another flaming graph
UPDATE: after a very slow boot, re-run the node and it finish booting nearly instantly, what's the magic?
Here's another flaming graph for the node stuck on exiting
Upgraded latest parity-db, the slow boot problem still exists (it takes about 8 minutes to start the node which using Azure premium SSD), here's another flaming graph
Are you sure this database was created with parity-db 0.4.7? Previous versions created a long queue of write-ahead log files that would take a long time to process on start. So if you already a database created by an older version and open it with 0.4.7, it will be slow the first time.
Upgraded latest parity-db, the slow boot problem still exists (it takes about 8 minutes to start the node which using Azure premium SSD), here's another flaming graph
Are you sure this database was created with parity-db 0.4.7? Previous versions created a long queue of write-ahead log files that would take a long time to process on start. So if you already a database created by an older version and open it with 0.4.7, it will be slow the first time.
I can confirm I'm using 0.4.7, and I have removed the old database before I do the new sync.
The boot time is faster than the previous version (the previous will take more than half an hour to start, 0.4.7 about 7 minutes)
What does ls -alh log*
show inside chains/polkadot/paritydb/full/
dir before the slow start?
ls -alh chains/khala/paritydb/full/ | grep log
-rw-r--r-- 1 root root 0 May 14 20:21 log0
-rw-r--r-- 1 root root 65M May 14 20:15 log1
-rw-r--r-- 1 root root 65M May 14 20:15 log2
-rw-r--r-- 1 root root 65M May 14 20:15 log3
-rw-r--r-- 1 root root 65M May 14 20:15 log4
-rw-r--r-- 1 root root 65M May 14 20:15 log5
-rw-r--r-- 1 root root 65M May 14 20:15 log6
-rw-r--r-- 1 root root 65M May 14 20:21 log7
ls -alh polkadot/chains/ksmcc3/paritydb/full/ | grep log
-rw-r--r-- 1 root root 57M May 14 20:19 log0
-rw-r--r-- 1 root root 75M May 14 20:15 log1
-rw-r--r-- 1 root root 65M May 14 20:15 log2
-rw-r--r-- 1 root root 88M May 14 20:15 log3
-rw-r--r-- 1 root root 65M May 14 20:15 log4
-rw-r--r-- 1 root root 65M May 14 20:16 log5
-rw-r--r-- 1 root root 72M May 14 20:16 log6
e4defrag 1.44.1 (24-Mar-2018)
<Fragmented files> now/best size/ext
1. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/index_06_16
2077/1 8 KB
2. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_08_cc
2/1 8 KB
3. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_08_cf
2/1 8 KB
4. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_06_ce
2/1 8 KB
5. /data/khala-node/polkadot/chains/ksmcc3/paritydb/full/table_08_d3
2/1 8 KB
Total/best extents 17099/1270
Average size per extent 5411 KB
Fragmentation score 1
[0-30 no problem: 31-55 a little bit fragmented: 56- needs defrag]
This directory (/data) does not need defragmentation.
Done.
This looks normal. If it takes 8 minutes to fsync ~500Mb to the disk, you might be hitting IOPS limits on the disk or VM. Check your VM configuration or try on a different hardware.