[bug] Creating new parity files: "panic: too many shards"
Opened this issue · 9 comments
For larger files (the threshold is somewhere between 9.2 and 77MB) I consistently get this error when I try to create parity. Looking at memory usage, all files (one is 2.7GB) seem to be loaded in full. The error seems to come right after loading:
[1/1] Loaded data file "bsc.tar.zst" (578352090 bytes)
panic: too many data shards
goroutine 1 [running]:
main.main()
That error is when the number of data shards is >256, which is a par2 limitation. I suspect it has to do with the default block size not being intelligently picked, but fixed at 2000. Can you try with par2 c -s <n> ...
where n is larger than 2000 (but a multiple of 4) and see if that fixes it?
It is also happens when -s
and -c
are not specified. Maybe a better default would be nice?
I'm cracking my head a second time over how par2cmdline calculates blocksize and blockcount from a redundancy level (percentage of protection). I believe the blockcount is always 2000 if you specify a redundancy:
https://github.com/brenthuisman/libpar2/blob/master/src/commandline.cpp#L1099
On the other hand, there's a recoveryblockcount
being calculated, that that the one I should set for -c
?
https://github.com/brenthuisman/libpar2/blob/master/src/commandline.cpp#L1223
In summary it is this:
for (file in files):
filesize = sizeof(file)
sourceblockcount += (filesize + blocksize-1) / blocksize;
recoveryblockcount = (sourceblockcount * redundancy + 50) / 100;
Remaining question is how to get the blocksize. Assuming blockcount = 2000, I think we're almost always in this block:
https://github.com/brenthuisman/libpar2/blob/master/src/commandline.cpp#L1151
Unfortunately I'm struggling to read along there.
Do you have any idea for a heuristic for -s
and -c
based on the number of files and filesize (I always use per-file parity, so the only variable is filesize)?
This seems pretty robust and doing what I think it does:
def getblocksizecount(self,filename):
f_size = os.path.getsize(filename)
blocksize_min = f_size//2**15 # size can never be below this
blocksize_f = (f_size*self.percentage)//100
blockcount_max = 2**7-1 #some logic to keep blockcount and overhead for small files under control
if f_size < 1e6:
blockcount_max = 2**3-1
elif f_size < 4e6:
blockcount_max = 2**4-1
elif f_size < 20e6:
blockcount_max = 2**5-1
if blocksize_f > blocksize_min:
try:
blockcount = min(blockcount_max,blocksize_f//blocksize_min)
blocksize = blocksize_f/blockcount
except ZeroDivisionError:
blockcount = 1
blocksize = blocksize_min
else:
blockcount = 1
blocksize = 4
blocksize = (blocksize//4+1)*4 #make multiple of 4
return int(blocksize),int(blockcount)
I'll keep this open as a reminder to do calculate the parameters a bit more intelligently. The snippet you posted looks plausible, I assume you're gonna calculate that in your external app and pass that in.
(Also, I misspoke above, the shard limit for par2 is 65536, not 256 (which is par1).)
OK, good idea. Indeed, this is what I calculate and pass in. Made a small modification to handle very small files.
The shard limit I found in par2cmdline is 2**15 (~32k), not 65536. I tested this, and gopar too showed a threshold there. Hence the 2**15 in the snippet.
Ah, yes you're right! Forgot there was a smaller limit for data shards.
A nicer place for the snippet would be in gopars own flags of course, but I didn't do that because I felt that having different logic from par2cmdline for the -r flag could be confusing. On the other hand, maybe that's taking legacy compatibility a bit too far. What's your opinion on that?
Yeah I don't think there's any real need to implement par2cmdline's computation exactly -- in fact, it seems pretty ad hoc, and I think if I think about it for a bit I can come up with a more systematic way.
The calculation above is only for single files, right? In general, par2 would have handle multiple files, which might change thing a bit.
Correct, this is only for single files, and therefore not ready for inclusion. I think par2cmdline
takes the largest file as a basis for a first blockcount estimate, but then there's a loop that converges on something but I'm not sure what, or what the goal was there.
I only work with single file parity (that's the whole idea of par2deep).