Proposed universal format for Unicode text files
Here I propose a universal format for Unicode text files, specified by its Ada implementation. Some features of the format are
- ASCII code points are encoded as themselves
- At most three bytes are used to encode a code point
- All code points may occur in a line
- Lines may be any length (though the implementation is limited to Integer'Last)
I have called this format Universal Text File format, with the acronym UTF. Like GNAT Programming Studio (GPS), this result is an acronym collision. Suggestions for alternative names are welcome.
The format is specified here by its Ada implementation in package UTF, which is short and straightforward.
Three simple tools are provided:
- To_UTF, to convert an Ada.Text_IO file to UTF
- From_UTF, to convert a UTF file into an Ada.Text_IO file
- Umore, a simple
more
program for UTF files
Most tools for processing text files on the major platforms work with native text files from other platforms, and UTF-8 has been widely adopted for encoding Unicode text files, so it seems unlikely that an alternative will gain much traction. However, having done this, I thought I would share it should anyone be interested.
Encodings should normally only be used externally to a program. Encoded input data should be decoded immediately upon input, and output data encoded immediatly before output.