documented chunk size is too large
joeyh opened this issue · 2 comments
It's not really a good idea to set chunk size to 1 gb, because git-annex currently has to buffer a whole chunk of a file in memory. So, that could make git-annex use 1 gb of memory or more.
http://git-annex.branchable.com/chunking/ documents this, and suggests something in the 1 MB range for chunks. Partly due to memory size and partly because this minimizes the amount of redundant data transferred when resuming an upload/download of chunks.
If rclone supports resuming partial uploads and downloads of large files, it might make sense to pick a larger chunk size, since the latter concern wouldn't matter. The memory usage would still make 1 gb too large for chunks.
Joey,
Thanks for git-annex and for the feedback.
I have actually run into this issue. I had to adjust a VM's memory allocation to accomodate brief periods of 3-5gb of memory usage (during the periods of time that git-annex is preparing a chunk for upload). In retrospect that's probably not a sensible configuration for most users.
There are a few issues here:
- rclone is not yet particularly efficient at copying a single file - rclone/rclone#422 - the authors are aware and are working on a fix.
- Even if rclone only required a single POST request for a single chunk, RTTs to setup the TCP connection and TCP slow start would mean a lot of time spent with less than optimal throughput.
- During a 'drop', unless the repo is set fully trusted, git-annex is going to want to verify the continued presence of each of the chunks. This means a few RTTs for each chunk.
I think 1MiB is far too small. For an archive with large files, a lot of chunks become required quite quickly which will be a significant performance hit. In my view the happy medium is probably closer to 50MiB or 100MiB - this would be literally 50x or 100x less overhead (at the cost of tens or hundreds of megabytes of ram).
In an ideal world I think git-annex would use a combination of variable chunk sizes and pack files to (1) hide the size of files, and (2) optimize interaction with remotes. To use a contemporary example, the Panama papers have been described in the press as as a multi-TB data set with tens/hundreds of thousands of individual files. With git-annex's current design, simply by looking at a user's file chunk sizes, I think it would be relatively trivial to identify users in possession of this dataset - even if the aggregate dataset size was not a match (i.e. they also had other files in their repo).
With all that said - do you think a 50MiB documented default might be a better choice than 1MiB? Or are there use cases I'm not adequately considering?
(If rclone could be used as a library, http connections could be reused to
avoid TCP slow start. That's what git-annex does for S3 and WebDAV.)
50MiB sounds like a better choice for git-annex-remote-rclone. But it
would be worth mentioning the tradeoffs or linking to
https://git-annex.branchable.com/chunking/
I do think I could probably make git-annex not buffer the chunks in
memory in this case. Opened a todo
https://git-annex.branchable.com/todo/upload_large_chunks_without_buffering_in_memory/
(I've considered adding padding of small chunks to get all chunks the
same size; varying chunk sizes might also obscure total file size some,
but attackers could do many things to correlate related chunks and so
get a good idea of file sizes.)
see shy jo