Reading and writing CBOR documents in pieces
sjlongland opened this issue · 17 comments
Hi,
Is it possible to parse and and emit a CBOR document in parts? The use case here is where I might have a CBOR file that's in the order of 2kB in size, but I don't want to allocate a 2kB buffer to read the whole thing in one go (I've allocated 4kB for a stack and don't have a lot of heap space available).
The thinking is when reading, I might allocate a 256 or 512-byte buffer (on the stack), read the first buffer-full worth of data, then call the parser on that. As I get near the end, I watch for CborErrorAdvancePastEOF
… when this happens, I obtain a copy of TinyCBOR's read pointer, do a memmove
to move the unread data to the start of my buffer, read
some more in, then tell TinyCBOR to move back to the start of the buffer to resume reading whatever CBOR value was being pointed to at the time.
Likewise writing; when CborErrorOutOfMemory
is encountered, I can do a write
of the buffer, then tell TinyCBOR to continue from the start of the buffer and resume writing the output.
Such a feature would really work well with CoAP block-wise transfers, as the data could be effectively "streamed" instead of having to buffer the lot.
I tried looking around for whether there was a flag I could specify on the decoder, but couldn't see any flags in the documentation.
Regards,
Stuart Longland
Hello Stuart
Thank you for the suggestion. This is definitely a goal for TinyCBOR.
It's possible to do what you ask for parsing, though currently only in the dev branch (and possibly my fork's dev branch). It's not a trivial API. Here's the current implementation:
For reading, you read until you get CborErrorUnexpectedEOF
(any other error indicates corrupted CBOR stream). When you have more data, call cbor_value_reparse
. I don't have an API to adjust the pointers in the structures, so you must either do it manually or instead use the cbor_parser_init_reader
which will call out to your own callback functions where you supply the data.
I have not implemented for writing, but I do have a cbor_encoder_init_writer
with callback. It's possible this implementation already allows for just re-sending the same content after the output buffer is adjusted.
Either way, dealing with strings is non-trivial. Even the current implementation of the reader requires the entire string chunk to be present in memory to be decoded. You'd need to chunk your strings on writing (no API for this yet) and you'd need to ensure your writer only sent chunks smaller than your buffer.
@thiagomacieira @sjlongland Hey! Any update on this? I have exactly this scenario, streaming CBOR payload of potentially large size (thousands of bytes) to a hardware crypto wallet "Ledger Nano S", which definitely falls under CoAP (Constrained Application Protocol), since communication from host machine (sending the long CBOR payload) to the Ledger is limited to 255 bytes. But I cannot make use of more than ~1600 bytes in total for my C program (Ledger Nano S app). Which will probably be fine, since I just need to hash it and also parse out some sparse data from by CBOR payload.
Made extra tricky in my situation is the fact that some of my CBOR Map Type (Major type 5) will be larger than the chunk size of 255 bytes. i.e. a single CBOR Map type might begin at chunk_n
not end until chunk_n+4
.
@thiagomacieira I realize I'm late to the party (18 months), but do you still have any (potentially stale) branch for doing chunked reading of a stream? (Yeah I did not say, only reading relevant for me...)
@thiagomacieira also was cbor_value_reparse
a typo? I cannot find a method with such a name anywhere in source code, nor in git commit (git log -S "cbor_value_reparse "
turns up empty, on master branch)
Hmm does this PR #67 relate to this? Or is it only for strings (text), and not relating to top level CBOR Major types such as Array/Map ("object")?
Sorry for confusing questions, I'm fairly new to CBOR and entirely new to this great great project, and nor am I an expert at C...
Hello @Sajjon
The dev branch here and in thiagomacieira/tinycbor should have this API working. I'd welcome feedback on suitability for small systems. The biggest roadblock I have had in publishing the API is making sure it's good for small systems. In particular, it was Zephyr, which has a chunked buffer of 256 byte slices and it would be really nice of TinyCBOR could just work with them.
cbor_value_reparse
is not at typo, but this function only exists in the master branch. It's what you must call after you've fed the buffer.
Do note one important detail: you cannot split integers across buffers. So your rebuffering code must deal with up to 8 bytes that could not be interpreted off a chunk and must be copied to the next buffer before processing can resume.
Hi @thiagomacieira I can't find cbor_value_reparse
in this repository?
Uh... looks like it's only in my experimental branch. See here: https://github.com/thiagomacieira/tinycbor/tree/dev
Uh... looks like it's only in my experimental branch. See here: https://github.com/thiagomacieira/tinycbor/tree/dev
Thanks!
@thiagomacieira I have some feedback regarding the suitability of cbor_parser_init_reader
for small systems. The "simplereader" example parses strings using cbor_value_dup_byte_string
, which in turn uses malloc
to deal with unknown-length strings. I prefer to use a combination of cbor_value_calculate_string_length
and cbor_value_copy_byte_string
to ensure the string will fit into a statically-allocated buffer.
I have not found a way to use cbor_value_calculate_string_length
with the incremental parsing API. In the non-incremental scenario, iterate_string_chunks
modifies the parsing position on a throw-away "tmp" structure, allowing subsequent API calls to start at the beginning of the string again. In the incremental scenario, the user-supplied transfer_string
has no way to know that it is temporarily traversing the string versus supporting a copying operation. Perhaps CborParserOperations
could include a function for reversing a given number of bytes, to mimic the effect of discarding "tmp" ?
So, an update from my end… I'm starting to have a look at this. Starting point is the reader interface. I've begun by documenting what I understand of the interface, my work is here -- if I did misunderstand something in my hand-static-code-analysis, please sing out!
My thinking is that to implement the "chunked" read, we start with a buffer of a fixed size. On the first read, we fill the buffer, and read values out of it as normal. When the read pointer gets to the half-way point in this fixed buffer (i.e. all data prior has been read), we memcpy
the second half of the buffer over the first, move the pointer back to the start of the buffer, then copy new data into the second half.
That way, provided a value doesn't span more than two blocks, we should be able to read each value in its entirety. Tricky for networked streams, but for reading files from off-chip storage (e.g. SPIFFS on SPI flash in my case) it should work reasonably well. Looks like to do this, I must implement a struct CborParserOperations
object with functions that define how to do this, and the token
I pass to cbor_parser_init_reader
is in fact, a struct
of my choosing that contains the reader context; which will be round-tripped to the methods of the struct CborParserOperations
object.
Now, the snag I see with this is the transfer_string
method; this returns a pointer to the data stored in our buffer. A pointer that may shift backward by BUFFER_LEN/2
bytes if we decide to read in a new block. I can see this pointer is directly returned by cbor_value_get_byte_string_chunk
, so knowing when it is safe to move on is the sticking point. The moment I do that memcpy
and pointer decrement, the pointer returned by transfer_string
is immediately invalidated.
I wonder if the interface shouldn't have some sort of flag that says transfer_string
pointers are only considered valid immediately after being called, and may become invalidated by any further read_bytes
/advance_bytes
/transfer_string
operations? Or is that in fact the assumption, that once you receive such a pointer you should immediately copy the data out before the pointer becomes stale?
(Edit: I should read others' posts more closely… I see @atomsmith more or less asked the same thing.)
With regards to @atomsmith's observations… I can see two places where this problematic "duplication" of CborValue
is done:
- https://github.com/widesky/tinycbor/blob/fa8e84e58e0d22e93190fe74ca688d7470bd719e/src/cborparser.c#L1175
- https://github.com/widesky/tinycbor/blob/fa8e84e58e0d22e93190fe74ca688d7470bd719e/src/cborparser.c#L1214
Again, where the next
pointer is set to &tmp
, we can be safe that it is indeed throw-away… but if next
is given by the user, then we have a problem similar to that mentioned in my last comment: we don't know how long this pointer is to be considered valid.
By the looks of things, there'll not be more than two CborValue
objects operating on a buffer at the same time, but that's an educated guess. My thinking is that the *next = *value
construct should be replaced by a function that does the duplication; which maybe is a further method in struct CborParserOperations
, perhaps with a separate function that "frees" the duplicate.
Given this more thought, at the moment our problem is centred around the reader context (token
) is a property of the CborValue
which may be cloned, but basically references context that is common to all CborValue
s; fiddle with one, you mess with the context of all others.
The parser operations are referenced by the CborParser
, but nowhere is there a place to reference a parser-global context for a given reader. To my thinking, we need two context areas:
- a parser-global context which stores things like a pointer to the buffer storing (part or all) of the CBOR document being iterated over
- a cursor-local context which stores where a specific instance of
CborValue
is pointed at.
In the end, I think for the cost of an extra sizeof(void*)
bytes (4 bytes on most 32-bit platforms), we move ops
out of the union
inside CborParser
so that it's a separate member, and in its place, just put a generic void*
pointer called ctx
(context) which may be used however the reader code sees fit. Then in CborValue
, we can re-cycle the token
pointer any way we choose.
I re-worked the unit test case so that the reader context pointed directly at the QByteArray
being used, and the token
was simply the unsigned integer byte position, cast to a void *
. The reader operations are tweaked so they receive a pointer to the CborValue
, thus are able to manipulate its token
pointer arbitrarily. I guess if we were paranoid, I could instead pass pointers to the CborParser
's ctx
pointer and CborValue
's token
so that a user-supplied function could modify those without manipulating anything else, but that'd require passing two arguments and not one to those operation functions.
From the writing end… things seem to be straightforward enough, but I'm a bit confused about the CborEncoderAppendType
… this seems to imply the behaviour should change depending on whether we're appending a fixed-length CBOR object or a byte/text string.
This difference in behaviour is not reflected in the default implementation (which just reads/writes to a byte array).
Things seem to be working for both read and write on my devices… reading a file off SPIFFS and "streaming" it through a small buffer. If I avoid using dangerous tinycbor
functions like cbor_value_get_byte_string_chunk
or using the next
argument provided by some iteration functions, things work well enough with my changes.
Question is, since I based off https://github.com/thiagomacieira/tinycbor/tree/dev do I submit the pull request there, or on this project?
Hello
Sorry for the delay, I was unavailable last week.
From the writing end… things seem to be straightforward enough, but I'm a bit confused about the CborEncoderAppendType… this seems to imply the behaviour should change depending on whether we're appending a fixed-length CBOR object or a byte/text string.
That's what is in the design, but looking at my own code, that enumeration is unused. I don't have notes why I added that enumeration, but for some reason a few years ago I thought it was important to let the handler know whether the data being appended was part of CBOR structured data or the free-form string content. When you add a string to CBOR, the callback function is called twice; once with CborEncoderAppendCborData
with the string byte descriptor and length, and once with CborEncoderAppendStringData
.
Question is, since I based off https://github.com/thiagomacieira/tinycbor/tree/dev do I submit the pull request there, or on this project?
What I should do is import everything from there into the dev branch here, create the new "main" branch from it, delete the "master" branch, so your PRs should go here, not to the fork where I make my own changes. Let me see if I can get a week of dedicated time to TinyCBOR so we can make progress towards the 0.6 release.
Meanwhile, thanks for the PRs in that branch/fork. I can pick them up and try with Qt, to see if it breaks anything or causes performance regressions.
Ok, I've updated the main and dev branches (they're now in sync) with 0.6. The last release of 0.5 is done.
Can I ask you to retarget this to dev branch, for an upcoming 0.7 release?
Sure, I'll have a look. :-)
New pull request is #208