tomerfiliba-org/reedsolomon

Predicting RS encoded bytes size

Cioscos opened this issue · 3 comments

I need for the sake of the function of my software to predict the size of the encoded block of data generated by RSCodec.encode().

This is an extract of my code:

while start_idx < file_size:
      end_idx = min(start_idx + max_data_size - len(header_bytes), file_size)
      data_chunk = file_bytes[start_idx:end_idx]

      # Create header with size information for each chunk
      header_with_size = header_bytes + data_chunk

      # Add padding to make the size equal to max_data_size
      if len(header_with_size) < max_data_size:
          padding_length = max_data_size - len(header_with_size)
          header_with_size += bytes([0] * padding_length)

      # Encode file data with Reed-Solomon error correction
      encoded_data = rs.encode(header_with_size)

      # Append encoded data to the buffer
      data_buffer.extend(encoded_data)

      start_idx = end_idx

As you can see I've a if statement that add some padding to the header_with_size bytearray if its size is littler than max_data_size.
If this if is triggered it makes header_with_size 1024 bytes long. After the rs.encode function, the encoded_data becomes 1074 bytes long.
If I comment out the padding if statement, I had cases where header_with_size's was of 300 bytes and the encoded_data's size was 310 which should be correct because I've initialized the RSCodec in this way "rs = RSCodec(ecc_bytes)" where ecc_bytes is 10.
So I thought that the encoded_data size is always 10 byte longer than the original size but it doesn't seem to be like this since when the bytearray is 1024 bytes long the encoding function adds 50 bytes, and I would like to understand if it's possible to predict this behaviour.

Thank you in advance

I've created this plot to understand the trend of the difference between input and output size when the input grows up.

image

The tests confirm that the implementation has this behavior which is kinda unpredictable if you don't test before.

Thank you for your feedback and your question. This is almost certainly due to the automatic chunking that is done in the RSCodec.encode() method. Although I agree it would be better if it could have a more predictable behavior, the chunking is dynamically done based on n-k (size of the input message minus the ecc bytes).

Instead of trying to predict when the chunking will happen, I would rather suggest that you forego using the RSCodec object which always was intended as an easier facade interface to use the Reed-Solomon algorithm. Instead you can use the low-level interface, so that there is no chunking, you can fully control the whole encoding process.

Anyway, if you can provide me with a self-contained minimalistic reproducible example code, then I can investigate this issue further and try to improve the reproducibility of RSCodec chunking in the future.

Thank you for your feedback and your question. This is almost certainly due to the automatic chunking that is done in the RSCodec.encode() method. Although I agree it would be better if it could have a more predictable behavior, the chunking is dynamically done based on n-k (size of the input message minus the ecc bytes).

Instead of trying to predict when the chunking will happen, I would rather suggest that you forego using the RSCodec object which always was intended as an easier facade interface to use the Reed-Solomon algorithm. Instead you can use the low-level interface, so that there is no chunking, you can fully control the whole encoding process.

Anyway, if you can provide me with a self-contained minimalistic reproducible example code, then I can investigate this issue further and try to improve the reproducibility of RSCodec chunking in the future.

Hi! Thank you very much for your answer!
So I can't send my original code since it works with files so I've created a gist with the code used to generate the plot.

About the "low-level interface" I would like to use it but I'm not enough expert to understand how it work since I don't know how the reed-solomon error correction codes works. Can you hep me with a brief example for my case?