Question regarding the API
liamsi opened this issue · 4 comments
I'm writing a go-wrapper for this implementation; BTW the C wrapper makes that really convenient, thanks for that!
I struggle to completely understand the API though. What I'm trying to do is quite basic: given n
original data buffers or symbols, I want to generate n
"parity symbols" s.t. that I can loose any n
(of the now 2*n
total) symbols and still recover the data.
I mainly experimented with the API in banchmarks.cpp: OK, sounds like I would need to call leo_encode
with original_count == recovery_count
for that, right? leo_encode_work_count
returns 2*original_count
for these params and calling leo_encode
yields encode_work_data
with the size 2*original_count
as expected. So far so good. What is surprising though: I would have expected encode_work_data
to contain the original data too instead it does seem to contain 2*original_count
recovery or parity symbols. So is there no way I can achieve what described above using the public C API or am I using recovery_count wrong?
Thanks in advance :-)
Looking through the paper that describes the encode and ReedSolomonEncode
it seems like the code reads mostly like a straightforward implementation of this:
Source: https://www.citi.sinica.edu.tw/papers/whc/3450-F.pdf
So with that notation, n=2k
and I want to hand over k
message symbols and get k
parity symbols.
What would be the correct parameters for that? As described above leo_encode
returns 2k
symbols instead (which do not clearly contain the original symbols, at least not in a non-encoded form).
I think we (or mainly @musalbas) figured it out: if the number k
is <= 256, leopard behaves exactly as described above. If k
> 256 then the recovery bytes will be 2k
instead which seems to be a property of ff16 ...
Well, actually this is not the case (this is independent from ff8 or ff16):
leo_encode_work_count
always multiplies the next power of 2 of recovery count by 2:
Line 102 in b58d1ea
So e.g. in the case of orig_count
== recovery_count
== some power of two, this means that half the arrays in encode_work
won't be used (the 2nd half will always be just all buffer_bytes sized arrays that are all zeroes). It seems superfluous to always double the data here. Is there any particular reason for this? The encode method also requires this behaviour while there are edge cases (e.g. work_count < 2*recovery_count && work_count == m
), where work_count
doesn't need to be 2*m
):
Lines 162 to 166 in b58d1ea
The workaround for my use-case was dropping the superfluous chunks (the 2nd half) on encode and decode.
Sorry just noticed your work here. I hope you were able to get through the issue