nci/gsky

Towards Polygon Drill in Constant Time

Closed this issue · 2 comments

The WPS polygon drill computes statistics (i.e. averages) per band per file for a user specified polygon region and a time range. This is an IO intensive operation as we need to load all the band data for the files in question and there is no simple way we can speed up the IO given the amount of data we need to load.

Having said that, we can exploit the fact that the statistics never change once the data files are written to disk. Thus we can compute/update those statistics only once during the crawling process and store them into MAS. Next time we can directly read off those pre-computed statistics from MAS for each band and each file requested by user. This strategy uses a single database lookup instead of computing each band each file from scratch, which in theory has O(1) complexity if we do not consider database lookup time. In practice, we still have latency pulling data from MAS. But this latency is usually a fraction of a second, which is too small to be a concern.

We essentially shift the expensive IO operations to the crawling process. But crawling is always done offline and usually performed in a distributed HPC environment with much higher scalability than the online GSKY services and no concern with service response latency. Thus we expect that this strategy brings overall benefits.

bje- commented

I thought the point of GSKY is that there would be no need to pre-compute and store anything about the data, as compute is now cheaper than disk? What are the storage requirements compared to the original data you are computing summary statistics for?

"I thought the point of GSKY is that there would be no need to pre-compute and store anything about the data, as compute is now cheaper than disk?" - this only works on paper. In reality, we need to compute things fast enough within client's patience to wait for the answer. Plus it is always the best to explore options to deliver answers in near real time because if there exists such a way to compute things in near
real time, competitors such as opendatacube will do it.

"What are the storage requirements compared to the original data you are computing summary statistics for?" - The storage requirements for these stats are very small. Apparently MAS stores all the timestamps per file. In this PR, we store additional 4 numbers (min, max, mean, and standard devivation) for each timestamp. For all the datasets I am aware of, I haven't seen any files with more than dozens of timestamps.