Support storing trailing \0 byte at the end of string
khng300 opened this issue · 5 comments
Hi, not sure if I miss anything but I recently discovered cista::generic_string did not store the \0 byte at the end of a long string (or string that just hit the short_length_limit length limit). As a workaround I currently draft my own string type for the purpose.
Is there any plan to work this out? Or do we need to propose a new type?
(Not really related but just a side-topic: What about support storing \0 within the content of a short string?)
Correct. Currently cista::string/string_view
have both the behavior of std::string_view
in the way that it doesn't store the terminating \0
like C-style strings do. The reason is that usually this terminating \0
is not something you want to have serialized into a compact binary buffer. A terminating \0
is not necessary in case you know the exact length (which is the case in cista::string
). The only reason you might want to have the terminating \0
would be compatibility to library code written in C. In all other cases, you do not want to have the overhead of storing/transmitting obviously redundant information (size + \0
terminator).
It is, however, not that hard to trick cista::string
into storing your extra \0
. One way would be to call the constructor that takes a char const*
and a length. There, you can set the length to the length of the string including the terminating \0
. You might want to create a wrapper around cista::string
that uses this trick in a few more places. But I don't think it's necessary to create a completely new type for this purpose.
cista/include/cista/containers/string.h
Line 341 in 0a7a784
Consider an idea of automatic "null-terminator with size" for small-string optimization (by Andrei Alexandrescu):
https://youtu.be/kPR8h4-qZdk?t=410
With a little bit of "mixing" and use it as could be embedded for non small-string (adding extra 4/8 bytes at the end for size/null-terminator, as above).
This will help to use cista::string.data()/.begin()
directly for const char*
inputs, since now we have to go through conversion to std::string_view/std::string.c_str()
.
That technique makes sense. Currently, the cista::string
does not have a capacity (only size). The idea is that for serialization, the capacity and size would always be the same, so there's no point in having an extra field. If you use the data structure as a replacement for std::string
that's a different story.
Overall I think it makes sense to write a new generic class, that can work as a vector
and a string
with the "small-vector" or "small-string" optimization. Making this generic has the advantage that cista::vector
would not need to allocate memory in case the data fits into its fields. Another advantage would be to be able to change CharT
and have a cista::wstring
. Doing Andrei's optimization would also be nice.
However, currently I am busy with another project. So don't expect this to happen very soon (this also applies to the other issues you opened which would probably also benefit from this change).
Thank you for your analysis, I appreciate your insights.
...in the meantime, I will try to propose: no so generic solution - focusing specifically on non-heap string in the next few days.
@khng300 Wow, superb work! I will be conducting comprehensive tests on my end throughout the weekend.