Large WMS or WCS query doesn't scale beyond single gRPC worker node

Question

Large WMS or WCS query doesn't scale beyond single gRPC worker node

Closed this issue 6 years ago · 0 comments

The load balancer https://github.com/nci/gsky/blob/master/ows.go#L224 randomly picks a worker node during the initialization of wms or wcs pipeline. Thus the pipeline sticks to the same worker node during its lifetime. In case we have large volume of intersected files returned from tile_indexer for a large requested polygon, https://github.com/nci/gsky/blob/master/processor/tile_grpc.go#L46 will become a bottleneck as we only have connection to single worker node here. Large volume of intersected files can come from either a large requested polygon or aggregation from long period of time (e.g. cloud removal using the past 3 months of data)

In order to scale beyond single worker node, we need the load balancer within tile_grpc.go before line 46.

An extreme example of large polygon request is -179,-80,180,80 which covers the entire world.

http://gsky ip address/ows/geoglam?SERVICE=WCS&service=WCS&crs=EPSG:4326&format=GeoTIFF&request=GetCoverage&height=500&width=1000&version=1.0.0&bbox=-179,-80,180,80&coverage=global:c6:monthly_frac_cover&time=2018-03-01T00:00:00.000Z

The above query results in processing 271 fractional cover files with 3 bands each. The experiment has two worker nodes. Each worker node contains identical worker code and identical number of worker processes. With some quick-n-dirty experimentation, I have the following performance benchmark data:

5 runs with the original code (i.e. baseline in seconds):
7.281
7.235
7.179
7.201
7.212

5 runs with the proposed load balancer within tile_grpc.go before line 46:
We load balance two worker nodes in round-robin fashion
3.986
3.862
3.845
3.887
3.862

We can see about 50% linear speedup, which is to be expected. The code for the experimentation contains hard-coded worker addresses and hacks for a quick proof of concept. A proper code solution is required.