How tensorflow stores data
yorkie opened this issue · 2 comments
In tensorflow's world, its low-level implementor should split all of types into the following two: non-TF_String
and TF_String
.
Because tensorflow treats TF_String
not like the traditional languages, it's an element with variable-length. For example, the string "yorkie is so cute" is a TF_String
element, correspondingly, the character "x" is also a TF_String
one.
What's a tensor
Let's start with the keyword "tensor". The tensor is actually a data structure, also as a multidimensional array. That I have to say, an array, the vector in mathematical, is the edge case of the tensor structure.
For example, an tensor could be represents as:
100
[ 1, 2, 3 ]
[ [ 2, 3 ], [ 5, 7 ] ]
[ [ [ 1 ], [ 2 ], [ 3 ], [ 4 ] ] ]
In the above example, every line is a tensor. The top number is called scalar tensor, and next is vector, matrix and n-tensor. The n
represents the dimension of your tensor or array as:
- scalar: n = 0
- vector: n = 1
- matrix: n = 2
Formally, I'm going to introduce the concept of shape array, and show how it works with the parameter n
. Every tensor's structure is shaped by an vector, and the number n
is the length of this vector.
And from the start position to the end, every element describes what's size in its specific dimension. The shape [5]
describes a vector which owns 5 elements, [3, 2]
describes a matrix which owns 3 sub-vectors, which owns 2 scalar elements, that the total number of elements is 3 * 2 = 6
.
Take a more complex example, the shape [100, 99, 5, 5]
represents a tensor which owns a 100 elements, which's shape is [99, 5, 5]
as a matrix, the total elements number of this tensor is 100 * 99 * 5 * 5
.
Store a tensor
In last section, we have covered what's a tensor, and how to represent it in an human-readable way. Next, we will take a look at how to store a tensor in machine.
{
"type": "int8/int16/int32/float16/float32/string",
"shape": <vector>,
"buffer": <....>
}
The above structure describes the 3 fields. In fact, all the real data is put to the field buffer
, it's a fixed array in storage, and we could call the field type
and shape
as the metadata of the buffer
:
type
describes the element sizeshape
describes how to encoding and decoding with buffer
Oh, string has no fixed size
As we have written in the beginning of this note, tensorflow treats the string in a variable-length type, that's the problem of the encoding method util now.
To represent the tensor composed with string correctly, implementor should introduce another array, offsets indices, to tell the encoder/decoder that every element's size. The offsets indices is a uint64 array, and its size is the number of elements. For example, if we have a string tensor:
"foobar", "yorkie is so cute"
The encoder just writes normally as before, the only difference is that we should put the start position of every string into the "offsets indices" vector. Corresponding, we decoding the buffer by reading this vector, "offsets indices" as well.
Here we have a C API story, actually in tensorflow's C API, there are 4 relevant functions thats:
TF_StringEncode
TF_StringDecode
TF_EncodeStrings
TF_DecodeStrings
And the TF_EncodeStrings
and TF_DecodeStrings
are not exposed. In my first attempt to implement string tensor encode/decode at yorkie/tensorflow-nodejs@ce922f7, I misunderstood the full implementation are inside those functions, I got an error as:
Malformed TF_STRING tensor; element 0 out of range
After reading the function TF_Tensor_DecodeStrings
and TF_Tensor_EncodeStrings
, the "offsets indices" logic is defined there, and got to know that TF_StringEncode
/TF_StringDecode
is for another purpose. Then I re-implement the "offsets indices" in JavaScript in my own implementation.
Summary
In this note, I share about how tensorflow stores data internally, alternatively put a little story with its C API, too. If you are going to implement a tensorflow client, this might help you to build the basis of your library.
I have implemented the encoding and decoding at tensorflow-nodejs, if you are interested in getting more details, take a look at the following links:
哇,为啥都是英文的啊
哈哈,锻炼英文嘛