tarrade/proj_multilingual_text_classification

What is so special about the tfrecord format?

vluechinger opened this issue · 0 comments

Tfrecord is Tensorflow's own binary storage format for data files (including sequence data).
With this format, a better performance of up to 50% can be reached since the files require less space on the disk, less time to copy and can be read more efficiently. This is especially useful for memory intensive data and batch processing.

The data is stored as a sequence of binary strings. To convert any data type, follow these major steps (already implemented in src/...):

  1. convert all entries into either tf.train.BytesList, tf.train.FloatList, or tf.train.Int64List (and strings into bytes), then into tf.train.Feature (for each feature), create dictionary out of them
  2. store each sample in tf.train.Example or tf.train.SequenceExample
  3. serialize it
  4. use a tf.python_io.TFRecordWriter to write it to disk

Find more information in the Tensorflow documentation.