gluster/glusterd2

Need to avoid setting 'all-bricks-in-cluster' in transaction context for volumer {create,expand,replace-brick}

kshlm opened this issue · 7 comments

kshlm commented

Volume create, expand and replace-brick transactions perform brick validation to ensure that the new bricks are not alredy existing bricks or part of already existing bricks. To do this, these transactions set a list of all bricks in the transaction context.

Setting a list of all bricks into context becomes an expensive operation when working with a large number of volumes (~1000). This has several effects,

  • Leads to hitting etcd gRPC message size limits. (Default values of 2MB). This affects both the etcd servers and the etcd clientv3 used by the GD2 store.
  • Leads to a large etcd database. Setting large values in etcd will lead to etcd database easily reaching its limits.

We're attempting to solve the above by,

  • increasing the gRPC message limits for the etcd server and the etcd clientv3 client in GD2.
  • increasing the database maximum size for the etcd servers and enabling periodic compactions and enabling auto compaction and snapshots.

While this helps, we'll surely hit the limits again as we reach larger volume counts.

To give ourselves more headroom, we need to try to avoid setting list of all bricks. There are two ways we can go about doing this.

  • Have each node involved in the transaction fetch the list of bricks on its own. One issue with this approach is that the list fetched by the peers could be inconsistent when multiple operations are happening.
  • Avoid brick validation all together when using smart volume creation. We only hit such high volume limits when using smart volume creation with GCS. We can be guaranteed that the bricks exist and are unique when using smart volume creation.

I'm currently leaning towards option 2.

@aravindavk @atinmu Do you have any thoughts?

For now second option looks good. But to support mix of manually provisioned volume and smart volumes we need to validate the brick path to prevent use by multiple volumes.

@kshlm regarding option 2, if we end up having a bad bug introduced in the brick allocation logic in smart volumes, then this might be dangerous, isn't it? Or do you see that can never happen?

Isn't there a way (I'm not sure) if we can actually query etcd on batches like fetch 1-300 bricks at 1 batch and then 301-600 bricks in 2nd and so on and then consolidate?

@atinmu we can use ETCD pagination for querying

kshlm commented

For now second option looks good. But to support mix of manually provisioned volume and smart volumes we need to validate the brick path to prevent use by multiple volumes.

I'm not proposing that we remove brick validation completely. The validation would be skipped when the request is detected to be a smart vol request. A normal request would include the validation step.

@kshlm regarding option 2, if we end up having a bad bug introduced in the brick allocation logic in smart volumes, then this might be dangerous, isn't it? Or do you see that can never happen?

The brick paths are generated by the bricksplanner should be unique based on the the volume name. There may be issues with parallel requests to the same volume, but the locking should help avoid this. But we could make it even more by using use the volume uuid and brick uuid, which should give unique brick paths.

Isn't there a way (I'm not sure) if we can actually query etcd on batches like fetch 1-300 bricks at 1 batch and then 301-600 bricks in 2nd and so on and then consolidate?

That is part of the problem here. The initial fetch performed by the txn leader before setting 'all-bricks-in-cluster' will be helped by the pagination as @Madhu-1 suggested. The other problem is that once all the bricks are fetched, we're setting a single large value with in the txn context. Pagination will not help with this.

@kshlm regarding option 2, if we end up having a bad bug introduced in the brick allocation logic in smart volumes, then this might be dangerous, isn't it? Or do you see that can never happen?

All generated brick path will contain volume name so collision will not happen. But collision can happen if manually provisioned volumes are allowed.

Isn't there a way (I'm not sure) if we can actually query etcd on batches like fetch 1-300 bricks at 1 batch and then 301-600 bricks in 2nd and so on and then consolidate?

@atinmu we can use ETCD pagination for querying

I think etcd response size is not the problem. We should make bricks validation as single node transaction step and avoid setting the response again in context for next transaction.

#1478 fixes it.

Fixed through #1478