yorkie/me

How tensorflow stores data

yorkie opened this issue · 2 comments

In tensorflow's world, its low-level implementor should split all of types into the following two: non-TF_String and TF_String.

Because tensorflow treats TF_String not like the traditional languages, it's an element with variable-length. For example, the string "yorkie is so cute" is a TF_String element, correspondingly, the character "x" is also a TF_String one.

What's a tensor

Let's start with the keyword "tensor". The tensor is actually a data structure, also as a multidimensional array. That I have to say, an array, the vector in mathematical, is the edge case of the tensor structure.

For example, an tensor could be represents as:

100
[ 1, 2, 3 ]
[ [ 2, 3 ], [ 5, 7 ] ]
[ [ [ 1 ], [ 2 ], [ 3 ], [ 4 ] ] ]

In the above example, every line is a tensor. The top number is called scalar tensor, and next is vector, matrix and n-tensor. The n represents the dimension of your tensor or array as:

  • scalar: n = 0
  • vector: n = 1
  • matrix: n = 2

Formally, I'm going to introduce the concept of shape array, and show how it works with the parameter n. Every tensor's structure is shaped by an vector, and the number n is the length of this vector.

And from the start position to the end, every element describes what's size in its specific dimension. The shape [5] describes a vector which owns 5 elements, [3, 2] describes a matrix which owns 3 sub-vectors, which owns 2 scalar elements, that the total number of elements is 3 * 2 = 6.

Take a more complex example, the shape [100, 99, 5, 5] represents a tensor which owns a 100 elements, which's shape is [99, 5, 5] as a matrix, the total elements number of this tensor is 100 * 99 * 5 * 5.

Store a tensor

In last section, we have covered what's a tensor, and how to represent it in an human-readable way. Next, we will take a look at how to store a tensor in machine.

{
  "type": "int8/int16/int32/float16/float32/string",
  "shape": <vector>,
  "buffer": <....>
}

The above structure describes the 3 fields. In fact, all the real data is put to the field buffer, it's a fixed array in storage, and we could call the field type and shape as the metadata of the buffer:

  • type describes the element size
  • shape describes how to encoding and decoding with buffer

Oh, string has no fixed size

As we have written in the beginning of this note, tensorflow treats the string in a variable-length type, that's the problem of the encoding method util now.

To represent the tensor composed with string correctly, implementor should introduce another array, offsets indices, to tell the encoder/decoder that every element's size. The offsets indices is a uint64 array, and its size is the number of elements. For example, if we have a string tensor:

"foobar", "yorkie is so cute"

The encoder just writes normally as before, the only difference is that we should put the start position of every string into the "offsets indices" vector. Corresponding, we decoding the buffer by reading this vector, "offsets indices" as well.

Here we have a C API story, actually in tensorflow's C API, there are 4 relevant functions thats:

  • TF_StringEncode
  • TF_StringDecode
  • TF_EncodeStrings
  • TF_DecodeStrings

And the TF_EncodeStrings and TF_DecodeStrings are not exposed. In my first attempt to implement string tensor encode/decode at yorkie/tensorflow-nodejs@ce922f7, I misunderstood the full implementation are inside those functions, I got an error as:

Malformed TF_STRING tensor; element 0 out of range

After reading the function TF_Tensor_DecodeStrings and TF_Tensor_EncodeStrings, the "offsets indices" logic is defined there, and got to know that TF_StringEncode/TF_StringDecode is for another purpose. Then I re-implement the "offsets indices" in JavaScript in my own implementation.

Summary

In this note, I share about how tensorflow stores data internally, alternatively put a little story with its C API, too. If you are going to implement a tensorflow client, this might help you to build the basis of your library.

I have implemented the encoding and decoding at tensorflow-nodejs, if you are interested in getting more details, take a look at the following links:

CruxF commented

哇,为啥都是英文的啊

哈哈,锻炼英文嘛