/tensorformats

A parser for different tensor file formats

Primary LanguageDBoost Software License 1.0BSL-1.0

TensorFormats

TensorFormats is a library for reading different tensor file formats in the D programming language. The file formats are used for machine learning models, like large language models.

Features

  • Read tensors from different file formats using the same interface
  • Mmap can be used for mapping the file into memory
  • A file can be read in parts, so less memory is needed

Limitations

  • Only reading and not writing is supported
  • No alignment guaranteed
  • Additional format specific metadata not available
  • Quantised formats are not supported yet

Usage

The example dumptensors.d can be used to print the tensors in a file:

dub tensorformats:dumptensors -- tests/data/tensors/tensor-dims.safetensors

0x00000000 0x00000000 buffer=0 dim0 float_ shape= stride=[]
  single value = 4
0x00000004 0x00000000 buffer=1 dim1 float_ shape=5 stride=[1]
  [0, 1, 2, 3, 4]
0x00000018 0x00000000 buffer=2 dim2 float_ shape=2x4 stride=[4, 1]
  [[0, 1, 2, 3],
   [10, 11, 12, 13]]
0x00000038 0x00000000 buffer=3 dim3 float_ shape=3x2x3 stride=[6, 3, 1]
  [[[0, 1, 2],
    [10, 11, 12]],
   [[100, 101, 102],
    [110, 111, 112]],
   [[200, 201, 202],
    [210, 211, 212]]]
0x00000080 0x00000000 buffer=4 dim4 float_ shape=2x3x2x2 stride=[12, 4, 2, 1]
  [[[[0, 1],
     [10, 11]],
    [[100, 101],
     [110, 111]],
    [[200, 201],
     [210, 211]]],
   [[[1000, 1001],
     [1010, 1011]],
    [[1100, 1101],
     [1110, 1111]],
    [[1200, 1201],
     [1210, 1211]]]]

Here is a short example how tensors can be read from a file:

import tensorformats.tensorreader, tensorformats.storage;
auto storage = new FileStorage(filename);
TensorReader reader = readTensors(storage);
while (reader.readNextBuffer())
{
    auto dataBuffer = reader.read(reader.bufferSize(), ReadFlags.none);
    foreach (tensor; reader.tensorsInBuffer)
    {
        // Use metadata in `tensor` with data in `dataBuffer`
    }
}
storage.close();

The file is split into buffers, where every buffer can contain multiple tensors. The pytorch format allows overlapping tensors in the same buffer. The metadata for a tensor has to be used to interpret the data.

The file format is automatically detected by readTensors. It is also possible to instantiate a reader for one particular file format instead.

License

Boost Software License, Version 1.0. See file LICENSE_1_0.txt.