lemire/Code-used-on-Daniel-Lemire-s-blog

Implication that padding zeroes would be copied seems to be wrong in multiple ways

jarikirma opened this issue · 3 comments

const static std::string options[] = {
"https\0\0\0", "http\0\0\0\0", "ftp\0\0\0\0\0", "file\0\0\0\0",
"ws\0\0\0\0\0\0", "wss\0\0\0\0\0", "garbage\0", "fake\0\0\0\0",
"httpr\0\0\0", "filer\0\0\0"};
std::vector<std::string_view> answer;
answer.reserve(length);
for (size_t pos = 0; pos < length; pos++) {
const std::string &picked = options[d(gen)];
answer.emplace_back(std::string_view(picked.data(), strlen(picked.data())));
}

I'm not really a C++ coder, but first of all: strlen explicitly strips length beyond the first '\0', which prevents testing this code with non-zero bytes after one. Secondly, it would seem that even if strlen is replaced with 8 (... which breaks the half of tests which actually relies on getting the length this way), those bytes are not copied - are they just by chance uninitialised zero generated by padding?

I found this out while trying out a relatively simple nul-terminated string to zero-padded uin64_t with a fast path under assumption that only strings which cross page boundaries need byte-accurate reads and that padding can be performed by applying Hacker's Delight algorithm for finding the first zero byte.

Please refer to the C++ documentation on string_view.

The line...

answer.emplace_back(std::string_view(picked.data(), strlen(picked.data())));

is not meant to copy any of the bytes. A string_view is just like a pointer.

I could have just done...

std::vector<const char*> answer; 
...
answer.emplace_back(picked.data());

and it would have been pretty much the same result.

strlen explicitly strips length beyond the first '\0',

The strlen function does not affect the input, it merely returns a number.

Per StackOverflow:

How do you construct a std::string with an embedded null?

So, I'd suggest the following:

  using namespace std::string_literals;
  const static std::string options[] = {
      "https\0\0\0"s,    "http\0\0\0\0"s,  "ftp\0\0\0\0\0"s, "file\0\0\0\0"s,
      "ws\0\0\0\0\0\0"s, "wss\0\0\0\0\0"s, "garbage\0"s,     "fake\0\0\0\0"s,
      "httpr\0\0\0"s,    "filer\0\0\0"s};

This is a std:string specific issue...

You are correct, I will change my code.