akalin/gopar

[bug] Creating new parity files: "panic: too many shards"

Opened this issue · 9 comments

For larger files (the threshold is somewhere between 9.2 and 77MB) I consistently get this error when I try to create parity. Looking at memory usage, all files (one is 2.7GB) seem to be loaded in full. The error seems to come right after loading:

[1/1] Loaded data file "bsc.tar.zst" (578352090 bytes)
panic: too many data shards

goroutine 1 [running]:
main.main()

That error is when the number of data shards is >256, which is a par2 limitation. I suspect it has to do with the default block size not being intelligently picked, but fixed at 2000. Can you try with par2 c -s <n> ... where n is larger than 2000 (but a multiple of 4) and see if that fixes it?

It is also happens when -s and -c are not specified. Maybe a better default would be nice?

I'm cracking my head a second time over how par2cmdline calculates blocksize and blockcount from a redundancy level (percentage of protection). I believe the blockcount is always 2000 if you specify a redundancy:

https://github.com/brenthuisman/libpar2/blob/master/src/commandline.cpp#L1099

On the other hand, there's a recoveryblockcount being calculated, that that the one I should set for -c?

https://github.com/brenthuisman/libpar2/blob/master/src/commandline.cpp#L1223

In summary it is this:

for (file in files):
      filesize = sizeof(file)
      sourceblockcount += (filesize + blocksize-1) / blocksize;
recoveryblockcount = (sourceblockcount * redundancy + 50) / 100;

Remaining question is how to get the blocksize. Assuming blockcount = 2000, I think we're almost always in this block:

https://github.com/brenthuisman/libpar2/blob/master/src/commandline.cpp#L1151

Unfortunately I'm struggling to read along there.

Do you have any idea for a heuristic for -s and -c based on the number of files and filesize (I always use per-file parity, so the only variable is filesize)?

This seems pretty robust and doing what I think it does:

	def getblocksizecount(self,filename):
		f_size = os.path.getsize(filename)
		blocksize_min = f_size//2**15 # size can never be below this
		blocksize_f = (f_size*self.percentage)//100
		blockcount_max = 2**7-1 #some logic to keep blockcount and overhead for small files under control
		if f_size < 1e6:
			blockcount_max = 2**3-1
		elif f_size < 4e6:
			blockcount_max = 2**4-1
		elif f_size < 20e6:
			blockcount_max = 2**5-1
		if blocksize_f > blocksize_min:
			try:
				blockcount = min(blockcount_max,blocksize_f//blocksize_min)
				blocksize = blocksize_f/blockcount
			except ZeroDivisionError:
				blockcount = 1
				blocksize = blocksize_min
		else:
			blockcount = 1
			blocksize = 4
		blocksize = (blocksize//4+1)*4 #make multiple of 4
		return int(blocksize),int(blockcount)

I'll keep this open as a reminder to do calculate the parameters a bit more intelligently. The snippet you posted looks plausible, I assume you're gonna calculate that in your external app and pass that in.

(Also, I misspoke above, the shard limit for par2 is 65536, not 256 (which is par1).)

OK, good idea. Indeed, this is what I calculate and pass in. Made a small modification to handle very small files.

The shard limit I found in par2cmdline is 2**15 (~32k), not 65536. I tested this, and gopar too showed a threshold there. Hence the 2**15 in the snippet.

Ah, yes you're right! Forgot there was a smaller limit for data shards.

A nicer place for the snippet would be in gopars own flags of course, but I didn't do that because I felt that having different logic from par2cmdline for the -r flag could be confusing. On the other hand, maybe that's taking legacy compatibility a bit too far. What's your opinion on that?

Yeah I don't think there's any real need to implement par2cmdline's computation exactly -- in fact, it seems pretty ad hoc, and I think if I think about it for a bit I can come up with a more systematic way.

The calculation above is only for single files, right? In general, par2 would have handle multiple files, which might change thing a bit.

Correct, this is only for single files, and therefore not ready for inclusion. I think par2cmdline takes the largest file as a basis for a first blockcount estimate, but then there's a loop that converges on something but I'm not sure what, or what the goal was there.

I only work with single file parity (that's the whole idea of par2deep).