Store examples in a more compact format
kblomdahl opened this issue · 1 comments
The features today are stored in a fairly verbose format, which takes up a lot of disk space for any non-trivial dataset. Today we use the following format, which uses 2 bytes per floating point number:
struct Example {
features: [f16; 12996],
value: f16,
policy: [f16; 362]
}
The total size of such an example is 26718 bytes. A few observations can be made about this representation:
features
contains only0
or1
so it could be compacted to a bitset.value
is always-1
or+1
so it could be compacted to a single bit.policy
contains true floating point number between0
and1
but could be quantized to anu8
.
Such a structure would have a total size of 1987 bytes, a reduction of 13.45x. This is pretty non-trivial, but would come at the cost of a higher runtime cost to decode such an example in tensorflow, however since tensorflow is GPU bound at the moment this should not be a problem.
Any features we have already computed can be automatically converted to the new format with almost no loss of information. The only information lost is in the policy due to the quantization, but that should be minimal.
I am probably not going to commit this script as it is a one time thing, but I used the following script to convert all records in the database from the previous format to the suggested one: