hideakitai/MsgPack

Support packing `__FlashStringHelper*`

matthijskooijman opened this issue · 10 comments

On AVR, strings can be put into flash by using (among others) the F() macro. This macro returns a pointer to the string as a __FlashStringHelper* pointer, which is really just a type-casted char*, but with at different type so it can be handled differently. This was originally somewhat of an internal type, intended to allow overloading Print methods (so you can do Serial.println(F("foo")) and have it work, but this type has been used in a large number of libraries as well.

It would be nice if MsgPack could pack this type as well. This should be fairly straightforward, just add an overload of pack() and the packString* methods, which can cast the const __flashStringHelper* to a const char*, use strlen_P() to find the string lenght and use pgm_load_byte() on each successive address to find each character. Having a helper function to do the heavy lifting probably makes sense, to not repeat everything for the different string lengths, or maybe the packString* functions could see some refactoring to reduce duplication there.

One alternative would be to just pass the __FlashStringHelper to str_t, since at least Arduino's String class can just convert it. However, this involves a dynamically allocated buffer, which I'd rather avoid, and just handle this byte-by-byte instead. Looking better at the current code, I see that char* strings are currently also handled through str_t, so involving one extra allocation and string copy, might be good to also improve that.

If you're interested, I can probably find some time to submit a PR for this.

Thanks very much for the proposition. Followings are fixed on this branch. Please check if it works.

  • performance improvement for string
  • suppoort F() macro

Thanks. I haven't tested yet, but did look at the code. I think there's more opportunity for optimization, and I think you misunderstood the workings of the __FlashStringHelper*. I'll discuss the optimization first, and in a proposal for that, include the flash string stuff.

String pack improvements look good, but packing a const char* string will still end up calling strlen twice (once in pack to decide which string type use, then once in e.g. packString5 to get the string length to encode). Since strlen needs to iterate over the string, this is not a really quick operation. I would suggest adding a len argument to the packStringX functions, so pack can pass the already resolved string length to packStringX. The argument default to 0, and inside you can do if (len == 0) len = strlen(value); for backward compatibility (which still runs strlen() twice for zero-length strings, but that's probably acceptable. Alternatively, just make the default (size_t)-1 or something like that.

If you do this, an additional simplification would be to let the const char* pack functions also handle the str_t case. E.g.:

void packString8(const str_t& str) {
    packString8(str.c_str(), str.length());
}

This does not work for the __FlashStringHelper* since that needs additional processing for each character. But maybe additional code sharing can be achieved by introducing e.g. an packStringSize8(size_t) helper that just packs the string size, but not the data, that can be called by both packString8(const char*) and packString8(const __FlashStringHelper*). Having a packStringSize(size_t) helper might be even cleaner, which calls packStringSizeX based on the passed size, and can be called by e.g. pack(const char*) and pack(str_t) directly, removing the need for a size argument to e.g. packString8 as I suggested above. Similarly a packStringData and generic getStringLen could further reduce duplication.

Summarizing, you'd have:

size_t getStringLen(const str_t& str) {
  return str.length();
}

size_t getStringLen(const char* str) {
  return strlen(str);
}

pack(const str_t& str) {
  size_t len = getStringLen(str);
  packStringSize(len);
  packStringData(str, len);
}

pack(const char* str) {
    size_t len = getStringLen(str);
    packStringSize(len);
    packStringData(str, len);
}

packString8(const str_t& str) {
  size_t len = getStringLen(str);
  packStringSize8(len);
  packStringData(str, len);
}

packStringSize8(uint8_t len) {
              packRawByte(Type::STR8);
              packRawByte(len);
}

packStringSize(size_t len) {
              if (len <= (size_t)BitMask::STR5)
                  packStringSize5(str);
              else if (len <= std::numeric_limits<uint8_t>::max())
                  packStringSize8(str);
              else if (len <= std::numeric_limits<uint16_t>::max())
                  packStringSize16(str);
              else if (len <= std::numeric_limits<uint32_t>::max())
                  packStringSize32(str);
}

packStringData(const str_t& str, size_t len) {
              packRawBytes(str.c_str(), len);
}
packStringData(const char* str, size_t len) {
               packRawBytes(str, len);
}

Note that the pack() and packStringX() functions all have identical structures, so maybe some template helper could further reduce duplication there.

To add __FlashStringHelper* support, you'd just have to add:

pack(const __FlashStringHelper* str) {
  size_t len = getStringLen(str);
  packStringSize(len);
  packStringData(str, len);
}

packString8(const __FlashStringHelper* str) {
  size_t len = getStringLen(str);
  packStringSize8(len);
  packStringData(str, len);
}

size_t getStringLen(const __FlashStringHelper* str) {
  return strlen_P((const char*)str);
}
packStringData(const __FlashStringHelper* str, size_t len) {
  char* p = (char*)str;
  while (1) {
    uint8_t c = pgm_read_byte(p++);
    if (c == 0) break;
    packRawByte(c);
  }
}

Note that the first two functions are again identical as before, except for typing. The latter two are the actual processing functions. Also note that this has a loop to process byte by byte, since there is no way to just convert to a char* and use that directly (since that would require copying data from flash to RAM and need a buffer which we want to avoid).

As for your second commit, I'm not sure I understand it really. AFAIU, before this commit, const char* was not actually accepted as a valid type for serialize? The only pointer types allowed were binary pointers, which must be followed by a size_t parameter? And with this change, serialize now does accept const char* and __FlashStringHelper* as things to serialize? But only const char* is supported by pack() in your branch, so I think passing __FlashStringHelper* would cause an error? This will be fixed by the above, of course.

However, there's another ambiguity that this change introduces: What if I want to serialize a const char* string followed by a size_t integer? I suspect this will now (with your branch) result in an ambigious function call, or maybe it will favor the binary version for having less template arguments or something? Before your second commit, this was resolved by not allowing any pointers in the varargs serialize version, but this is now broken.

Thinking about this, I wonder if the serialize(T*, size_t) should exist at all? AFAICS, serialize is a helper function that allows serializing multiple things, but this version only allows a single thing. And if you want to serialize a single thing, why not just call pack() directly?

A related question is why does serialize(arr_size_t, ...) even exist? AFAICS the generic varargs serialize could also handle it and forward it to pack(arr_size_t) which also exists? Same for map_size_t?

Sorry, I've completely misunderstood how F() works... Fixed and tested on commits above.

What if I want to serialize a const char* string followed by a size_t integer? I suspect this will now (with your branch) result in an ambigious function call, or maybe it will favor the binary version for having less template arguments or something? Before your second commit, this was resolved by not allowing any pointers in the varargs serialize version, but this is now broken.

I haven't explained but in the previous version (master branch), only const char[](string literals) is allowed in serialize() without size argument, so I've fixed it to accept const char* in addition to __FlashStringHelper. Every types can be accepted and tested on latest branch above.

Thinking about this, I wonder if the serialize(T*, size_t) should exist at all? AFAICS, serialize is a helper function that allows serializing multiple things, but this version only allows a single thing. And if you want to serialize a single thing, why not just call pack() directly?

Yes, the size of array disappears in some context, so we should use size for pointers or arrays.

Sorry, I've completely misunderstood how F() works... Fixed and tested on commits above.

Yup, looks like it should work. Haven't tested yet, though. It does seem there's still room for more simplification and removing code duplication, as I suggested above (i.e. there's multiple places that encode the same string type/size bytes, and that decide how long a string to encode).

Yes, the size of array disappears in some context, so we should use size for pointers or arrays

My suggestion was to not support arrays as arguments to serialize for this reason, so if you want to pack an array, you always have to call pack directly, since that has no ambiguity between packing one or two items. E.g. consider serialize(const char*, size_t), I think that could be interpreted as an array of characters, rather than a string now (and even if that would not happen due to different overload priorities, who is to say that this is not actually meant as an array of characters, or alternatively a fixed-length, non-zero-terminated string)?

Yup, looks like it should work. Haven't tested yet, though. It does seem there's still room for more simplification and removing code duplication, as I suggested above (i.e. there's multiple places that encode the same string type/size bytes, and that decide how long a string to encode).

Yes, I will try to improve that while keeping current APIs (of course if possible).

My suggestion was to not support arrays as arguments to serialize for this reason, so if you want to pack an array, you always have to call pack directly, since that has no ambiguity between packing one or two items.

It sounds better. I haven't serialize() array or pointer types so far, and it is ambiguous as you said. Let's remove serialize(ptr, size) version and restrict them to use pack(). Maybe it affects to the serializing of:

  • BIN format pack(const uint8_t* bin, const size_t size)
  • ARRAY format pack(const T* arr, const size_t size)

Ok!

One more thing I realized: How to interpret pack(const char*, size_t)? Is that an array of signed 8-bit integers, or a string with fixed length? I think there is value in both, not sure how to disambiguate?

I've pushed improvements above to this branch (it's still previous ArxTypeTraits though).

How to interpret pack(const char*, size_t)?

I just overloaded pack(const char*, size_t) and it's not ambiguous because it's only one path for it (uint8_t* is not same as char* and T* arr is disabled when T is char).

(i.e. there's multiple places that encode the same string type/size bytes, and that decide how long a string to encode).

I think currently there's no duplication (it's not same as what you showed above though), where do you point out? I wonder I may miss it.

I just overloaded pack(const char*, size_t) and it's not ambiguous because it's only one path for it (uint8_t* is not same as char* and T* arr is disabled when T is char).

Ok, sounds good. You do lose the ability to encode an array of chars (rather than a string), but that's probably just fine (if you really need this, I think you can still pack an array size followed by individual chars).

I think currently there's no duplication (it's not same as what you showed above though), where do you point out? I wonder I may miss it.

Here:

packRawByte((uint8_t)Type::STR5 | ((uint8_t)len & (uint8_t)BitMask::STR5));

packRawByte((uint8_t)Type::STR5 | ((uint8_t)len & (uint8_t)BitMask::STR5));

This duplicates information about how to encode e.g. a string5 length, but that's not too much of a big deal I guess.

There was previously also duplication of the string length decision (which I found more significant) but you solved that by introducing the packString template, so I guess that's fine now.

Thanks for the comments, okay I'm going to merge this branch to master later.

Merged! Thank you so much for your great contribution again :)
I'm going to close this issue.