UDF is a simple, self-describing, structured binary file format for storing arrays and their relationships intended to store scientific and/or industrial data.
It is intended as a light-weight alternative to HDF.
A viewer for UDF files can be found here.
The UDF file is centered around the concept of a directed graph of datasets. A dataset is an unordered list of datatables and their properties. A datatable
The format on disk is simple enough to make it feasible to create custom implementations optimized for your project.
Datasets contain a dynamic number of datatables, making it possible to add additional datatables without interfering with existing applications.
Due to its use of absolute File offsets to reference data it is efficient to read only the parts of the file that are needed, skipping over everything else. This makes it feasible to read file contents over a network connection (eg. HTTP Range requests, see the viewer samples for a demo).
It is trivial to append new data to the end of the file and rewriting offsets to point to this new data. The previous data remains but is inaccessible (due to not being referenced). The unreferenced datasets can later be cleaned up with a separate garbage collection pass.
Datasets do not have to be written in a specific order. They are referenced by absolute File offsets making generating the data flexible.
On top of that the Datasets can be written to the file in parallel with minimal coordination.
The file format contains multiple defences to mitigate data loss from corruption. It is encouraged to append new datasets keeping old data around for a while. These old datasets can be recovered easily or garbage collected over time. Existing data is covered by checksums to verify integrity of only the datasets of interest.
The UDF file can be file mapped for convenient access to its structures. The structures and data are aligned to allow zero-copy access. Each Dataset is self-contained making it feasible to map only the data you're interested in.
Extensive system to describe the datatables and its relationships.
Each datatable contains its primitive type (eg. u8
, i32
, f64
, ...), dimensions and shape, a type hint to help interpret the data.
Relationships between datatables can take two forms: an index relationship (the datatable is an index or range into another datatable) and a related relationship (the datatable represents columns of the same structure).
The datatables come with type information to allow a generic viewer to visualize and display information regardless of the specific metadata.
Detailed specification can be found here.
TODO!