ECMWFCode4Earth/challenges_2020

Challenge #14 -Size, precision, speed -pick two

EsperanzaCuartero opened this issue · 11 comments

Challenge #14 - Size, precision, speed - pick two

Stream 1 - Weather related software and applications

IMPORTANT: this challenge is eligible to apply for cloud credits from WEkEO.

Goal

Optimising GRIB and NetCDF data encoding methods that we use operationally for CAMS atmospheric composition data at ECMWF.

The work on this project could help us to reduce both volume of data we store in our archive and the amount we disseminate to the users while preserving useful information.

data_encoding_puzzle

Mentors and skills

  • Mentors: @juanjodd @miha-at-ecmwf
  • Skills required
    • Some knowledge of meteorological data formats (GRIB, NetCDF) and libraries which are used to decode and manipulate them (ecCodes, netcdf, cdo, nco, ..)
    • Some knowledge about data encoding (data packing, accuracy, compression methods)
    • Knowledge of statistical metrics to understand and quantify errors due to different data encoding methods
    • Knowledge of a software library to compute and present the above metrics
    • Familiarity with a Chemical Transport Model (CTM) to be able to better appreciate non-linear aspects of the problem would be beneficial

Challenge description

Data and software

We plan to use the CAMS global real-time forecast dataset, ecCodes and NetCDF libraries to test different configurations and estimate data encoding errors and software library to compute and present results (possibly Python/numpy/matplotlib or R).

What is the current problem?

There is a lot of artificial precision in the current data encoding setup, CAMS data takes a long time to archive and download.

What could be the solution?*

We would like to remove artificial precision from the encoded fields without loss of useful information. At the same time we need to be conscious of operational constrains, so data encoding and decoding steps do not become prohibitively expensive.

The desired solution would be a combination of data encoding settings and step to achieve this goal.

Ideas for the implementation

Some things to address: a more appropriate bitsPerValue, log packing, various data compression algorithms, bit grooming.

Relevant publications

Thanks for this challenge (great plot!). I've been working with posits previously, and compared them extensively to floats (spoiler: posits have a higher information content than floats). In order to understand the difference between posits and floats in a given application better, I've developed sonums, a maximum entropy number format that learns from data, and although it can be used for calculations too, I was wondering how much this could be applied to CAMS data.

As far as I understand the current compression of CAMS grib data uses 16bit quantization with equi-distant steps (take min/max and divide the range in between in 2^16 bins). This would be a maximum-entropy encoding for uniformly distributed data, I assume. However, I guess, as you suggest log packing to be used, CAMS data is actually not uniformly distributed. A lot of geophysical data that I looked at is something like log-normal distributed, such that a float-like log encoding is actually sub-optimal, as it's too lossy in the middle and too precise for the tails. Posits' decimal precision has a pyramid-shape and therefore approximates a log-normal distribution much better. Sonums, on the other hand, have the advantage that they maximise the entropy regardless of the distribution, but maybe they fall into the slow read&write corner? Technically, the encoding is the expensive process as it requires training (basically just sorting the array, or a representative subset) and a binary tree search (with 16 steps for 16bit) for every number. Decoding is much cheaper: Just one request from a 2^16 element array per number.

Anyway, let me know what you think!

Thanks for the message @milankl, I've updated the broken link.

The links that you've sent are fascinating, I've never heard of posits before. As a concept it's something that seems to be perfectly suited for the problems that we are trying to solve. What I'm not sure about is how easy would it be to bring these ideas into our ageing data formats and archives.

As a preparation for the challenge we are trying to come up with an empirical test which would measure the loss of useful information due to a particular compression or encoding method. How could we go about and use posit-encoded data as one of the inputs?

I don't think that netCDF / HDF5 libraries currently support anything else than the standard formats, Float32, Float64, Int32, Int64, etc.. I don't know enough about grib to understand how the "max/min-range with UInt16"-method (does it have an official name?) is implemented there. But I guess, you can always reinterpret any posit bitstring as float and simply write that to file. That shouldn't cause any issues with lossless compression afterwards, and even lossy compression might work as posits basically only encode the exponent differently. Chopping off some significant bits is the same for posits/floats. However, that means you need to know that in a given file there are actually posits stored, not floats, and reinterpret them back.

But I guess such an implementation is actually beyond the scope of this project? My first idea would be to define norms for your three axes (volume, speed, information) and then throw some CAMS data into the different number formats and and find a sweet spot in that 3D space. Once the information content is optimized with a conversion that is not too slow (whatever the limits are here, I don't know) we could investigate how good other compression techniques that apply to data chunks (and as far as I know most of them don't actually care what the bits mean, i.e. it doesn't matter whether a number is posit, float, sonum or whatever encoded).

@miha-at-ecmwf

Thank you very much for your knowledgeable feedback @milankl . I have never heard of posits before either. You rightly guessed that CAMS data is not uniformly distributed. I would like to make you a question about your previous comment:

A lot of geophysical data that I looked at is something like log-normal distributed, such that a float-like log encoding is actually sub-optimal, as it's too lossy in the middle and too precise for the tails

CAMS data is internally calculated by our "model" using Float32 or Float64. This model writes its output as 24-bits, grid-simple packed GRIB files (Miha and I are involved in CAMS forecast production chain from this milestone: the model output GRIB files). I believed that log-preprocessing the Float32/Float64 values before converting them to grid-simple (a 24bit quantization) would improve the precision but if, I have understood properly your comment, this is actually sub-optimal, isn't it?

Yeah, log-preprocessing / packing is a good step and basically what floats do too. If you plot the precision of floats in a log-plot you get basically a flat line (check out Fig. 2a in here). That means if your data is exponentially distributed you get a uniform distribution within a certain range of numbers in such a plot. To minimize the compression error of such data floats are a pretty good idea, as you get a constant precision (Float16 for example has a decimal precision of 3.7, meaning that, worst-case, you get an error at the 3.7th decimal place). I assume that would also minimize your decimal error norm, which is the same as minimizing the L1 error of log-preprocessed data. How much that also minimizes your L2 error, I don't know, but I also don't know whether minimizing the L2 error is actually desirable.

However, I don't know whether CAMS data is actually log-distributed ?! Most variables that I've looked at so far (mostly velocities though) are rather log-normal distributed, sometimes a distribution that looks more like a fairly skewed normal distribution, when log-plotted. For those distributions (see Fig. 2b in the paper) floats (or a log-preprocessing) have too much precision at the tails. In the end you probably want to maximise the information entropy, which is roughly identical to minimizing the decimal error.

And this is where posits come in, as they have naturally a tapered precision towards the tails (and look like a pyramid therefore). The question is basically how good the data distribution (Fig 2b) matches the distribution of precision (Fig 2a), and pyramids are just a better approximation to (most?) data distributions compared to rectangles. As far as I understand the simple grib packing, (evenly distributed between min and max) this corresponds to to steep triangles (called Int16 or Q6.10), which are fairly difficult to fit to some data and you will always end up with more precision for larger numbers. This minimizes your L2 absolute error (L1 probably too?) but I doubt that this is actually what we want.

Sorry, was that clearer?

Thank you for your clear explanation @milankl and for sharing the paper above, I personally found it very interesting. Using your feedback as excuse, I asked to CAMS-model developers about the typical distribution of model's output variables and they confirmed most of them tend to be log-normal distributed.

Also apologies for this late reply, it is being a busy time for me.

Hey @juanjodd, thanks for contacting the CAMS-team. Interesting that also chemical tracers follow a log-normal distribution. This indeed sounds like a good case for a compression with tapered precision (similar to posits). Apart from error norms are there any other measures to quantify the retained information content after compression? E.g. should the compressed data be able to reproduce a given set of analyses that were previously performed at high precision?

Join us for our LIVE @ecmwf Summer of Weather Code Ask Me Anything session on 1 April 2020 at 2 pm (CET) (tomorrow).

Get infos first hand from the #ESoWC2020 organisers, mentors and former #ESoWC participants.
➡️Sign up

Only 4 days left to apply to be part of ECMWF Summer of Weather Code 2020.
Application deadline: Wednesday, 22 April 2020 at 23:59 (BST).
Submit your proposal here.

Just to let everyone and @miha-at-ecmwf and @juanjodd know, I've just submitted a proposal for this challenge and also uploaded it in this repository Elefridge in case anyone wants to look at it or has further ideas. Issues always welcome.