kblomdahl/dream-go

Store examples in a more compact format

kblomdahl opened this issue · 1 comments

The features today are stored in a fairly verbose format, which takes up a lot of disk space for any non-trivial dataset. Today we use the following format, which uses 2 bytes per floating point number:

struct Example {
    features: [f16; 12996],
    value: f16,
    policy: [f16; 362]
}

The total size of such an example is 26718 bytes. A few observations can be made about this representation:

  • features contains only 0 or 1 so it could be compacted to a bitset.
  • value is always -1 or +1 so it could be compacted to a single bit.
  • policy contains true floating point number between 0 and 1 but could be quantized to an u8.

Such a structure would have a total size of 1987 bytes, a reduction of 13.45x. This is pretty non-trivial, but would come at the cost of a higher runtime cost to decode such an example in tensorflow, however since tensorflow is GPU bound at the moment this should not be a problem.


Any features we have already computed can be automatically converted to the new format with almost no loss of information. The only information lost is in the policy due to the quantization, but that should be minimal.

I am probably not going to commit this script as it is a one time thing, but I used the following script to convert all records in the database from the previous format to the suggested one:

big2small.py.gz