BSF: The Big Simple Format file specification version 0.0.1 <add official release date here when satisfied> This document defines the Basic Simple Format file specification. Aims: define a simple to read/write format for numeric data. Accenting the simple part of the definition the data is laid out in a regular/fixed format that is n-dimensional and uncompressed. (A simple file format should not require readers and writers to be complex enough to support compression) All numbers are stored in high byte first order (Big endian) Example pseudo format: "BSF" : file type marker string <15-bit unsigned number> (Major version) <15-bit unsigned number> (Minor version) <15-bit unsigned number> (Subminor version) <some set of (C?) string key value pairs of metadata> <some affine definitin of the coordinate system?> number of dimensions (unsigned 63 bit type) dimension granularity byte count (all dimensions are unsigned). This field describes the number of 8-bit bytes each following dimension number is specified with. Valid values are 1 to 65536. dimension 0 : (a number within the granularity specified above) ... dimension n-1 : (a number within the granularity specified above) <the type of numeric data entries that follow in subentities INT [numbits between 1 and 65536] UINT [numbits between 1 and 65536] IEEE_FLOAT [numbits == 8? or 16 or 32 or 64 or 128] (some day: POSIT posit descriptors of some kind) > <subentity type: 1 (REAL) or 2 (COMPLEX) or 4 (QUATERNION) or 8 (OCTONION)> <entity dimension count: UINT15 0 = number, 1 = vector, 2 = matrix, 3+ = tensor> <entity dimension granularity bit count: valid values are 1 to 2^64-1 <entity dimension 0 : within entity dimension granularity> ... <entity dimension n-1 : within entity dimension granularity> <all the numeric data follows here sequentially. the order of elements is dimension0/dimension1/dimension2/etc and within then by entity and then by subentity. I will soon make an example that shows the order. the dimension indices increase leftmost first. the entities also go subentity dim 0 to subentity dim n-1.> Example code that would work to read the data portion: // read the data header values for (uint63 i = 0; i < NUMDIMS; i++) { read and store DIM value i read and store DIM calibration equation string } read and store the subentity component type (such as INT12, or FLOAT64, etc) read and store the number of components that make up a sub entity (1, 2, 4, or 8) read and store the entity type dimension count (i.e. 2 says the entities are 2-d matrices of 1, 2, 4, or 8 sub entities each) read and store the ENTITY dim granularity (UINT63) for i = 0 < entity type dim count read and store the ENTITY_DIM // now read the data total entities = each DIM multiplied together for bignum i = 0; i < tot entities; i++ // read an entity for uint63 d = 0; d < entity dim count for the dims of the entity (for instance a 3x4 matrix) for (se = 0; se < 1 or 2 or 4 or 8; se++) { read a subentity component value (like the 1st component of a cmplx num) } make a sub entity from those read values (like a quat<float32>) store the subentity at the right position within the entity (e.g set vector[7] = subentity read) Notes/questions - even though someone could be specifying UINT12's they take up two bytes. Again no form of compression is supported. - all numeric values are to be specified as running from high byte to lo byte (big endian). This simplifies the reading of the numbers into arbitrary precision integers. You can read one byte and then shift left 8 bits and read the next byte etc. It also makes the format more simple to understand. - equation language must be laid out so that one knows how to parse. for example specify what functions are valid (sin? acos? erfc? etc.) but the calibration strings can be ignored by anyone who wants to. - supporting this format requires arbitrary precision ints for sizes and arbitrary precision floats for axis calibration / equation support. - how complex should the subentity codes be. The basic couple I've defined? Or a ton (various color model formats, gaussian integers, others?). Ideally the entities can be identified enough that a program using this can know what mathematical operations are valid for the data. If you supported RGB and ARGB and YUV and CMYK and HSV/B you could reason what the 3 or 4 channels mean. Otherwise you'd just have one dimension that had value of 3 or 4 and you had no idea the data is RGB or ARGB or YUV etc. As I think more I believe that we should keep the type count minimized so no color model types. - should I define a fixed record format so that a record can be made of a set of mixed types? The format would still be a bunch of fixed records but not just of one kind of number. This would allow CMYK or YUV etc. but there is no inferring from a record type that it is a CMYK etc. MUCH LATER IDEAS Let's simplify a lot A BSF file is an n-dimensional set of data records. Each BSF file contains one kind of data record type. The BSF header defines the structure of the data records. A record is described as ... a list of int32 codes binary "1": signed integer. followed by a number of bits. binary "2": unsigned integer. followed by a number of bits. binary "3": ieee 8 bit float binary "4": ieee 16 bit float binary "5": ieee 32 bit float binary "6": ieee 64 bit float binary "7": ieee 128 bit float binary "8": utf16 char array. followed by a number specifying the size of the array. will these chars pack well given that each char can be a different size? binary "-1" : end of record a record definition is a list of consecutive codes And a default value for a record can be defined in the header. encoded matching the definition in the header (for instance someone could set to zeroes or NaN or blank strings etc) (it could be that when you define the record you also define the default value for each field instead of defnining a separate "zero" value record.) a record definition also has a 512 char array name A BSF file specifies it's number and size of dimensions. All numbers in a BSF file are saved in big endian format. All characters in a BSF file are saved as UTF16. Each record in a BSF file is preceded with the coordinates of the point. The coordinate system is that the origin of the data is always at the zero point of every axis with each axis growing away from 0 in a positive direction. Since the file specifies a default values record you can write sparse files by only enumerating points that have non-default values. Everything else in the file will be set to the default record value. There are some reserved record names. You can define your own record name or better yet register them here within this repo. I will document them as possible. Reserved record names: are strings going to factor into this? strX : X is any number and is the number of UTF16 chars in the string intX : X is any number and is the bits contained in the int uintX : X is any number and is the bits contained in the unsigned int ieee16 : ieee float 16 (half precision) ieee32 : ieee float 32 (single precision) ieee64 : ieee float 64 (double precision) ieee128 : ieee float 128 (quad precision) rgb24 : a record of three 8 bit numbers representing an RGB color rgb48 : a record of three 16 bit numbers representing an RGB color rgb72 : a record of three 24 bit numbers representing an RGB color rgb96 : a record of three 32 bit numbers representing an RGB color argb32 : a record of four 8 bit numbers representing an ARGB color argb64 : a record of four 16 bit numbers representing an ARGB color argb96 : a record of four 24 bit numbers representing an ARGB color argb128 : a record of four 32 bit numbers representing an ARGB color yuv cmyk labcie hsv or hsb or hsl or all of them bcd? decimal? etc. Header BSFF <int32> : version of file format <record definition> <default record value> <data records section> The data section records is just <dim 1 value><dim 2 value><...><dim n value><record value> dims go from 0 to max-1. I think 0-based is best for simple file reading into arrays. And continues like this until the end of the file. (One weakness of format: Due to sparseness a file that just ends due to bad communications with disk or across network maybe look complete. I think I need an EOF marker.> eof marker could be -100. -100 is never a valid dimension so how could you use this system to specify a record that was a 4x4 matrix or a 10 element vector or some tensor? a "conglomerate" type that has a backing number type and n-dims. a vector would be 1 dim of numbers. a matrix would be two dims of numbers. a thensor would be n dims of data. note that the backing type could be strings too.