openethereum/parity-ethereum

Initial Sync disk IO / write amplification / disk usage

c0deright opened this issue · 25 comments

  • Parity version: 1.7.0
  • Operating system: Ubuntu 16.04 LTS
  • And installed: via deb Package from https://parity.io/parity.html
  • cmd line: parity daemon /foo/bar/parity.pid
  • config.toml
[ui]
disable = true

[network]
nat = "any"
discovery = true
no_warp = true
allow_ips = "public"

[rpc]
disable = true

[websockets]
disable = true

[ipc]
disable = true

[dapps]
disable = true

[footprint]
tracing = "off"
db_compaction = "ssd"
pruning = "archive"
cache_size = 55000

[snapshots]
disable_periodic = true

[misc]
log_file = "/foo/bar/parity.txt"

I'm trying to setup a full node with complete history, thus pruning=archive.

Disk IO looks like this on Amazon EC2 instance type c4.8xlarge:

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
xvdf           4252.00         0.36       519.55          0        519

parity is constantly writing to disk with ~300-500 MByte/s and peaks reach ~5000 IOPS.

What really bothers me is that parity is wasting so much disk space. There are times where i can see that /home grows by 1 GB/s just watching df -h /home.

Some time after 2 Millions blocks were passed disk usage on /home was ~80 GB of data from parity alone. When stopping parity, that 80 GB of disk usage magically shrinks to ~37 GB of disk usage just to grow with 1 GB/s again after restarting parity.

Parity even ran out of disk space after filling up a 100GB EBS volume on Amazon AWS and at that time it had only downloaded about 50% of blocks.

My questions are:

  • why is parity writing at such a high volume to disk even though I've set cache size to 55GB?
    • is this some sort of write amplification? it isn't logical for me that an application that downloads less than 1 MB/s from the internet is writing 500 MB/s to disk at all times.
  • why is parity using so much disk space temporarily? Even without restarting parity, disk usage sometimes goes down 20 GB or more from one second to the next.
5chdn commented

You are not alone with this issue.

Probably related, but not obviously:

I am the user who posted

https://ethereum.stackexchange.com/questions/24158/speed-of-syncing-the-chain-in-parity-using-archive-pruning-mode

the size of my node when i don't run parity is about 235GB. When i launched with this command overnight

parity --pruning archive --snapshot-peers 40 --cache-size-db 256 --cache-size-blocks 128 --cache-size-queue 256 --cache-size-state 256 --cache-size 4096 --db-compaction hdd

it peaked at 430GB in the morning and when i closed parity it went back to around 230GB.

For information about my system i use:

  • Parity: version Parity/v1.7.0-beta-5f2cabd-20170727/x86_64-macos/rustc1.18.0

  • MacOS Sierra Version 10.12.2 (16C67) on an iMac 27-inch Mid 2011 with 24GB of DDR3 ram

and here's a pastebin of my syncing with parity

https://pastebin.com/pXZTL7G6

Disk Usage while still Syncing

The node was syncing until late sunday, when disk usage dropped and then stayed down.

Disk usage right now is 218 GB and while sync was active parity used more than twice that amount (488 GB).

Cache settings does not really affect write amplification. These are designed to reduce read amplification and minimize block processing times when the node is up to date. db_compaction is the only option what trades write amplification for space IIRC. Parity uses RocksDB as the underlying database. Importing a block involves adding a lot of random key-value pairs into the database some space is preallocated for faster insertion of new keys. Unused space is freed in the background compaction process. See here for more details:
https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide

We have a long term plan to move to a custom database backend that would allow for a more efficient state I/O.

subscribing to this as well, it's so annoying, parity is eating I/O like a monster constantly, 1000 times more than bitcoin or any bitcoin based coin.

there is no workaround currently to control/limit the i/o without breaking syncing process @arkpar?

@gituser could you post logs?

Disk Usage today
(Times are UTC +0200)

Parity running with

[footprint]
pruning = "archive"

just went from 250G to 500G in 1 hour, filling the volume.

After resizing the volume to 750G, disk usage drops back to 250G as soon as I started parity again.

I'm constantly having WTF moments working with parity :)

It seems we have to deploy monit to restart parity as soon as it goes nuts to prevent it filling the whole volume in minutes.

5chdn commented

The work on that already started, but it includes a new database layer and a lot of refactoring. It will not be available before 1.8. #6418

The cause of this spike seems to be a reorg:

2017-09-04 09:30:14  Imported #4236763 fe21…0f40 (165 txs, 6.68 Mgas, 5883.20 ms, 24.42 KiB)
2017-09-04 09:30:26     1/25 peers     34 MiB chain   73 MiB db  0 bytes queue   23 KiB sync  RPC:  0 conn,  0 req/s, 14381 µs
2017-09-04 09:30:38  Imported #4236765 2d3e…f35e (63 txs, 2.86 Mgas, 1266.32 ms, 9.94 KiB)
2017-09-04 09:30:56     1/25 peers     34 MiB chain   73 MiB db  0 bytes queue   23 KiB sync  RPC:  0 conn,  0 req/s, 14381 µs
2017-09-04 09:30:59  Reorg to #4236765 0cca…518a (2d3e…f35e #4236764 2a6f…58cb )
2017-09-04 09:30:59  Imported #4236765 0cca…518a (159 txs, 6.69 Mgas, 3762.77 ms, 30.28 KiB)
2017-09-04 09:31:04  Imported #4236766 2c59…fd49 (184 txs, 6.68 Mgas, 2557.42 ms, 26.92 KiB)
2017-09-04 09:31:23  Imported #4236767 cfd6…0e80 (113 txs, 6.66 Mgas, 8276.83 ms, 20.67 KiB)

The growing starts at around 09:30 where the reorg is logged.

Reorgs happened often before so it's not clear this has anything to do with the issue. For reference:

2017-09-04 04:09:13  Reorg to #4235967 931d…9fa8 (f0bb…9939 #4235965 7dec…6b4d c8c1…2510)
2017-09-04 04:21:15  Reorg to #4235996 e2b1…5bda (2ed2…bea8 #4235994 b3c6…65b0 1253…f871)
2017-09-04 05:16:45  Reorg to #4236119 c8cb…3545 (aacf…74b8 #4236118 131a…5288 )
2017-09-04 05:27:31  Reorg to #4236146 40bc…1c7d (4b8f…1cf5 #4236144 b689…249b d88e…949e)
2017-09-04 05:41:06  Reorg to #4236172 3cb3…7aa6 (d1ab…78f9 #4236170 8e5a…9b80 e40d…352a)
2017-09-04 05:52:36  Reorg to #4236198 1508…d746 (c450…7ee6 #4236196 c96a…c7ce a5d4…98a6)
2017-09-04 06:24:51  Reorg to #4236272 3633…abc7 (5ed8…b8c4 #4236270 0454…a9d7 6d2c…b25e)
2017-09-04 06:34:31  Reorg to #4236303 39c3…15fe (d88f…c139 #4236302 d470…3f4d )
2017-09-04 06:35:20  Reorg to #4236306 4ce7…ee98 (7d7e…5102 #4236304 6fe2…bbb9 66cb…932b)
2017-09-04 07:04:32  Reorg to #4236389 cc2c…8777 (8c73…575d #4236387 3ae2…d444 5bf5…da71)
2017-09-04 07:10:07  Reorg to #4236408 a094…9bc0 (3940…4ed9 #4236406 9bba…d53a 867b…f650)
2017-09-04 07:10:07  Reorg to #4236408 6ae1…972a (a094…9bc0 867b…f650 #4236406 9bba…d53a 3940…4ed9)
2017-09-04 07:54:43  Reorg to #4236507 cc53…1e5e (5472…7834 #4236506 c949…f930 )
2017-09-04 08:12:32  Reorg to #4236552 1402…1940 (d398…8aed #4236550 8dbf…055d d5f5…1a21)
2017-09-04 08:16:19  Reorg to #4236565 9ab9…866d (d499…8864 d132…bcb1 #4236563 2d7e…36b6 4b7b…4307)
2017-09-04 08:44:54  Reorg to #4236631 58ad…06c8 (0012…8979 #4236630 81ca…ed52 )
2017-09-04 09:06:19  Reorg to #4236703 781f…417f (f858…ab4e #4236702 21e4…8cbc )
2017-09-04 09:09:26  Reorg to #4236713 485b…db86 (43e6…a7ca #4236711 2e33…bff9 e667…582f)
2017-09-04 09:24:57  Reorg to #4236746 b5d2…d179 (4f2e…9503 #4236744 5657…8057 ff00…b65f)
2017-09-04 09:30:59  Reorg to #4236765 0cca…518a (2d3e…f35e #4236764 2a6f…58cb )
2017-09-04 09:40:53  Reorg to #4236788 195e…132b (9cba…ad41 #4236787 fafe…af79 )
2017-09-04 09:48:20  Reorg to #4236808 d6e5…9fd1 (55d2…e9a7 #4236807 c9ed…1dd2 )
2017-09-04 09:55:50  Reorg to #4236822 121d…7faf (a1f1…27d2 #4236820 9c71…a3da dcea…08f3)
2017-09-04 10:33:52  Reorg to #4236913 8596…af57 (24c1…a231 24e2…6124 #4236910 bbca…4a4e 3a65…3403 e9d5…e5d5)
2017-09-04 10:43:20  Reorg to #4236934 8c5e…70ee (e15b…bb99 #4236932 6418…6856 3330…2be1)
2017-09-04 10:52:00  Reorg to #4236959 56ba…53c4 (51c5…8cdb 7c08…8f92 #4236957 ad4e…c246 97c6…4a75)
2017-09-04 11:04:25  Reorg to #4236988 d12b…14e8 (073b…1ba9 #4236986 7d5c…d551 5194…2a10)
2017-09-04 11:07:14  Reorg to #4237000 6de0…f926 (bace…f9fc #4236998 2c6a…baeb 1806…e8e7)

The thing about archiveDB is that it keeps everything. It will keep the full state of all blocks processed, even those which are eventually reorganized out of the chain. I'm not sure how well rocksdb handles having that much data, but it will definitely put a strain on your storage. I am not sure that even a specialized database (which we are in the process of building) would alleviate this much.

Something more useful might be a semi-pruned mode, where we discard non-canonical states after a certain point, but keep all state of canonical blocks.

I created a chart of the chain folder size while syncing parity (v1.6.0) in --pruning fast and --pruning archive mode.

A) For the purple line the zig-zag is due to the regular pruning that occurs, right?
B) But what I don't understand is the spikes in the blue line, in archive mode. Looks like also some type of pruning - but there shouldn't be any, right?

Imgur

Higher resolution: https://imgur.com/a/cx9et

This seems to be the result of compacting the RocksDB. RocksDB writes a LOT to disk and from time to time this is compacted, resulting in massive disk usage drops.

Just for documentation:

CPU Usage and load average

The CPU usage has a very odd pattern, going up over the course of ~48 hours and then repeating. I guess this is the result of RocksDB with the compaction process running every so often.

I'm not complaining about ~20% CPU usage - just for documentation as I suspect this to be linked to the disk IO issue described here.

jlopp commented

I've been unable to sync a fresh Parity 1.7.2 archive node after two weeks, the disk usage is so terrible that I had to write a script to restart the node every half hour to free up space. Now that I finally got within sight of the chain tip, parity won't start up at all and just spins while maxing out the disk I/O and not syncing any blocks or logging any activity.

I'm experiencing the exact same thing as @jlopp, even wrote a similar script.

jlopp commented

Just to follow up on this, we were syncing Parity 1.7.2 with a 500 GB disk. Eventually we increased it to a 1 TB disk and were able to complete the sync. So there definitely appears to be a huge inefficiency somewhere that is causing the disk usage to be far higher than it needs to be. I just checked and one of our nodes that is still syncing is using 660GB of disk space, but if I restart parity it drops to 300GB.

Yep. Looks like a more permanent fix has been pushed out to 1.9 - as an interim solution, is there some way that Parity could trigger DB compaction more frequently, instead of having to stop and restart the process?

5chdn commented

we are looking for a more permanent solution for this and started working on our own database implementation https://github.com/debris/paritydb/

But 1.8 is about to be released very soon, therefore I modified the milestone.

jlopp commented

Cool; worth noting that I ran into similar issues with Ripple nodes - they also use RocksDB by default. Ripple ended up writing their own DB called NuDB and when we switched to it, the problems were fixed.

In case some weary traveler with finite disk space happens upon this github ticket before v1.9 comes out, here's my simple script to get sync working on a Mac:

import subprocess
import os
import time
import signal

while True:
    print("Running Parity...")
    proc = subprocess.Popen(['parity', '--tracing', 'on', '--pruning', 'archive'])
    print("Parity running with pid {0}".format(proc.pid))
    while True:
        time.sleep(30)
        # https://stackoverflow.com/a/787832
        s = os.statvfs('/')
        gigs_left = (s.f_bavail * s.f_frsize) / 1024 / 1024 / 1024
        print('{0} GB left'.format(gigs_left))
        if gigs_left < 90:
            break
    print("Terminating Parity...")
    os.kill(proc.pid, signal.SIGINT)
    proc.wait()

Has anyone suggested an archive mode that stores only the balances at each block?

I'm working on a fully decentralized accounting/auditing project that has been working fine since summer 2016, but over the last few weeks, Parity is constantly failing because its disc usage grows from 400GB to over 800GB about twice a day. This blows out my 1TB drive.

The recent article about the chain's size argues that the archive mode is unneeded (and does not increase security) because one can always rebuild the state by replaying transactions. This is true, and a perfectly legitimate position, but it misses a point. Without some source of a "double-check" that the rebuilding of state from transactions is acurate, it's impossible to have any faith in the results. You can end up at the end of the process with the same state, but what happens if you don't. You have a bug, but without an archive of previous states, finding that bug is impossible.

If there was a mode where, at each block, my code (which is building state from transaction history) could double check that it's correct and quickly identify problems. I know that some addresses don't even carry a balance, so this doesn't work for every address, but it would work for "accounting" where balances are all that really matters.

Upshot: add a feature called --pruning archive-balances that only stores balances per block.

Also a possible solution would be to store a checkpoint state every X blocks and recompute state from the closest checkpoint not from the genesis.

I asked a question a couple of days ago about the snapshot in Parity. (1) does the snapshot work even if one is not using archive mode, (2) can I get at the data in the snapshot? If there's a continuum from full archive mode to warp mode. Storing just balances would be closer to full archive and giving access to snapshots would be closer to warp mode. Both would work--balances at every block would be easier for my work, but either would be welcome because full archive is a real problem.

5chdn commented

🎉

Why do we care about old state anyway? Do smart contracts look back in time - I thought they can only see the block-chain and receipts? I don't believe they need to be able to see full history of each account, perhaps im wrong though. Nodes run synchronous and check the current state of current variables. My understanding is by having the full state you can run any and all transactions and therefore collect transaction gas fees etc... with a partial database you could not run all transactions broadcast... but you could still run a lot. Having said that, I reckon it would be feasible to write a node that selectively dumps massive chunks of old unused state performing some kind of opinionated and largely negative analysis of the chance a future transaction ever happening. Cos there must be a fair amount of crunk junk data in there. Plus way too much use of int256 means a lot of 0x000000 in front of yer digits.