ipfs/js-datastore-s3

Change key format for better sharding performance

Closed this issue · 3 comments

Behind the scenes of a S3 bucket objects are sharded based on their keys when under load. Shards are created per-prefix and a top level prefix is the characters in the key up to the first delimiter ('/'). Each prefix has a rate limit that will kick in after a certain number of requests to a given prefix.

Our key format is ${path}/${key}, where ${key} is a base32 encoded multihash for blocks or a string for datastore keys. This means if you specify (an optional) path, S3 will try to put all of your data into one shard, killing performance. If you don't specify a path, S3 will treat all your keys as prefixless and put them in one shard also killing performance.

We can avoid this by dropping the path parameter, reversing the key (since the first few characters of a base32 encoded multihash are stable) and adding / characters at known locations to create random-ish prefixes (though maybe we only want to do this if we're given a key that can be parsed as a base32 encoded multihash?). Prefix length used to be the first 6-8 characters but TBH it's not clear to me what it is now.

This may help prevent the sort of service degradation 3box experienced recently while using this datastore.

Refs:

cc: @oed - you've mentioned S3 sharding in the post-mortem linked to above - would this sort of change have helped prevent the service degradation you encountered?

oed commented

I will let @zachferland weigh in more here, but just our initial findings when looking at this is that if you use ´datastore-fs´, that is default in the nodejs runtime, the "sharding" flag will be enabled in ipfs-repo. This flag essentially takes the two last characters of the base32 multihash and uses them as folders. When we started using S3 as a backend using datastore-s3 this "sharding" flag got disabled. I think the problem could be solved fairly trivially by simply enabling the sharding flag when using datastore-s3 as well.

It sounds like this datastore needs to have sharding turned on by default, which will require a major release and some sort of update script.

Sharding is turned on by default for blocks in ipfs-repo but not for other stores and not in the browser.

In this datastore it's pretty essential and there's doesn't seem to be a good reason not to have sharding turned on.

I'm going to close this issue because the solution I've outlined above isn't the right one - we just need to wrap the datastore in a ShardingDatastore at the repo level.