Loading text files using byte order mark

Question

Loading text files using byte order mark

Dawoodoz opened this issue 4 years ago · 4 comments

string_load in Source/DFPSR/base/text.cpp is currently just basic skeleton code that works in the current examples, but it should read the BOM bytes at the beginning of a document, map the different formats correctly to the internal UTF-32 string format, and assume ascii if no BOM is detected.

While it could probably just link to some existing solution dynamically, the point of this project is that only trivial and well defined things may be left to interpretation by the compiler. If someone has to port this to another language a few hundred years for now, it's good to have a reference detailing the text interpretation byte after byte with different formats and how line endings are handled.

Answer 1 · 2020-08-18T00:50:20.000Z

Currently implementing support for UTF-8.

Answer 2 · 2020-08-19T12:50:36.000Z

Done implementing raw Latin-1 and UTF-8 with BOM. Files without a BOM will be loaded as Latin-1 just like before. Characters above code 127 must be contained in UTF-32 literals U"" to prevent being ambiguous, so it's a good habit to use U"" often to prevent mistakes. Saved files will now be saved as UTF-8 with BOM and CrLf line breaks to be readable in text editors on most platforms. UTF-16 BE and LE remains to be implemented. Internal representation is UTF-32 with Lf line-breaks so that one character or line break is one element in the string, which makes advanced parsing and text processing a lot easier. Combination signs are not yet implemented, but they can be applied while typing in a textbox component.

Answer 3 · 2020-08-19T20:35:02.000Z

Now supports UTF-16 and passing regression tests for many exotic languages. There's just no way to print the characters yet, because the font atlas is only for Latin-1 and there's no programmable font system for contextual characters nor right-to-left writing. Remaining character encodings will either be detected and throw an error or not detected and get garbage content interpreted as Latin-1.

Answer 4 · 2020-08-19T23:15:16.000Z

The first two bits in extended UTF-8 bytes can be used to validate that the file is valid UTF-8, but this works for now.