pangeo-data/rechunker

large number of small file size

apatlpo opened this issue · 18 comments

usage question I guess

I'm working on an HPC system with a GPFS filesystem and I've been told a couple of times by sys admins (ping @guillaumeeb for confirmation) that I should not produce (very) large number of small filesizes (typically <10Mb).
This potentially puts a lower bound on the intermediate file size.

With rechunker, the knobs that control the intermediate file size are the memory per worker and the target chunk size.
The former is limited by the size of your node (say 100GB), it leads to an upper bound for the intermediate fiile size

Does it mean that in some situations you may have no other choices than apply rechunker in multiple passes in order to complete a rechunk that is too ambitious given the constraints above ?

This is one of the downsides to the way rechunker and Zarr currently work. If you can think of a workaround, I'd love to know about it.

What would be great (as far as I'm concerned) is if rechunker was able to detect such situation and directly propose a multipass approach.
If you think this would be a valuable feature (difficult to know if this happens often), we can discuss this some more and I can lead PR
What do people think?

How would you detect this situation? What criterion would you use?

the intermediate chunk size can probably be related to an equivalent file size which needs to be above the minimum file size threshold

I actually did that manually by adjusting the target chunk size iteratively

But don't you care about the total number of intermediate files? AFAIK, that is the thing that stresses the filesystem, not the chunk size per se.

I had the feeling that as long as the files were large enough the filesystem was fine, no matter the number of files.
But maybe I don't understand how filesystems are working well enough (or at all :) ).
@guillaumeeb any thoughts ? (he's the one complaining when I produce small files on CNES cluster ...)

I doubt they would have complained if you created one single 1kb file. So "this situation" is some function of file size and number of files. I'd like to understand that function better before we start brainstorming solutions.

Ok, let me try to explain. We discussed this a bit in https://discourse.pangeo.io/t/best-practices-to-go-from-1000s-of-netcdf-files-to-analyses-on-a-hpc-cluster/588/13?u=geynard.

At CNES, we have a GPFS file system. It can handle 8,5 PiB of data, but has also a limited amount of metadata (inodes) it can record. I'm not sure of the number here, but we've converge to fix a limit on the mean file size of any given space on our storage: 1MiB. This doesn't mean that you can't create smaller files, even lots of them, but just that for each project space, you'll get an upper limit on inodes and thus files and directories number (which can be high if you're allotted space is too). Be aware that some other computing centers are much more conservative about inodes limits.

This GPFS file system has also a given bandwidth, up to 50GiB/s for us (shared between all nodes and users), but more importantly a limit on the number of IOps (Input Output operation per second, so file creation, random writes...) it can perform. This limit depends on the nature of the operation and a lot of other parameters, and is very hard to measure. But this is the one we reach the most (and our FS is really optimized for this) : a single user with a few hundred cores can slow down the entire FS with bad IOs. Two principal examples of bad IOs:

  • Using FS as a database, or having a parallel job iterating on some grid and randomly reading values into a netCDF collection, Byte by byte.
  • Generating a lot (say millions) of small files (<1MiB), the limitation won't be bandwidth, but IOps.

So I hope this clarifies things, and in the end you must be careful about two things:

  • The upper limit of inodes you can create in a given space.
  • the minimum chunk size you have to use in order to avoid overloading a FS with millions of IOps. At CNES, I'll say that targeting at least 10MiB is a good thing, but in some case you could do with smaller chunks, 1MiB being hard limit.

To go deeper, if I have to translate that in pseudo code:

if len(chunks) > some_high_limit
  do bigger chunks
if len(chunks) > 100_000 and sizeof(chunks) < 10MiB
  do bigger chunks
if sizeof(chunks) < 1MiB
  do bigger chunks

Of course, all these limits will change from HPCs to HPCs, but I'll bet no admin would enjoy lots of files of less than a MiB.

Very interesting. Will digest for a bit.

The tradeoffs remind me a lot of this paper: Predicting and Comparing the Performance of Array Management Libraries, which compares HDF5 and Zarr.

Note also that object store behave quite differently as you very well know @rabernat. For those, there is no metadata table, no inodes, the only pain point is write latency. This means you'll only want to have chunks big enough to avoid stacking up write or read calls because of the tens of milliseconds latency of each one.

To be honest, we really designed rechunker with object storage in mind, even though there are probably more users on HPC. Can you look at the algorithm and think of some strategy we could use to mitigate this inode problem? Zarr doesn't support consolidating multiple chunks into one file, although this has been discussed (can't find the github isssue now). An alternative would be to try using TileDB as an intermediate storage format. TileDB should support concurrent writes to the same chunk, allowing us to use bigger chunks. To do that, we would need a PR to rechunker to implement a different array backend.

@apatlpo - it would be good to get the exact size of the input chunks, intermediate chunks, and target chunks for your use case. How many files / what size are we talking about? Is this LLC4320?

To be clear, object storage also has fixed overhead per object. It's less of an issue than in distributed filesystems, but you will still run into performance issues if you store lots of files smaller than 1 MB.

This sounds like a reason to revisit #36, since distributed execution engines like Spark and Dataflow are designed to handle lots spilling outs of small intermediate outputs to disk.

Within Zarr, one way to solve this would be to make an intermediate "storage" object that maps multiple chunks to different parts of the same underlying files.

I'm curious if #36 would help @apatlpo. Can they run Beam on their HPC system?

In theory Beam can run on top of Spark, but probably not quite as smoothly as on Cloud Dataflow. (I haven't tried it)

Can you look at the algorithm and think of some strategy we could use to mitigate this inode problem?

I took a quick look at it, the most simple thing at first I cant think of is raising a warning if the intermediate chunks are smaller in size than a given limit or there are too many of them. The advice would be to do an intermediate chunking, or to use consolidate reads and writes?

In a second time, we could try to find another algorithm to determine the intermediate chunk size better than using the minimum of each axis. E.g. if shape is (4,1) in input and (1,4) in output, see if we could use (2,2) in intermediate if we have enough memory. I think this is what consolidate_chunk() is doing, but maybe not on intermediate chunks? And consolidate_writes as I understand, change the requested target_chunks size?

To be clear, object storage also has fixed overhead per object. It's less of an issue than in distributed filesystems, but you will still run into performance issues if you store lots of files smaller than 1 MB.

Yeah totally! Less impact on the overall service (and so other users), but performance would be affected also!

Spark and Dataflow are designed to handle lots spilling outs of small intermediate outputs to disk.

Spark could use HPC node local storage to spill to disk intermediate chunks, thus not affecting shared FS. Maybe it could also consolidate intermediate chunks on shared FS into bigger files using internal format, not sure.

I'm curious if #36 would help @apatlpo. Can they run Beam on their HPC system?
In theory Beam can run on top of Spark, but probably not quite as smoothly as on Cloud Dataflow. (I haven't tried it)

We can run Spark on CNES HPC system, it's a bit more complicated than Dask (no dask-jobqueue equivalent, so we do it by hand). Never tried Beam though.

@rabernat Yep, original data is llc4320 with the shapes you know by heart I'm sure.
Here is a summary a full transposition of one variable (1 face considered for convenience)

{'time': 8784, 'face': 1, 'i_g': 72, 'j': 24}
Individual chunk size = 15.2 MB
Source data size: 		 8784x4320x4320 	 655.7GB
Source chunk size: 		 1x4320x4320 		 74.6MB
Source number of files: 		 8784
Intermediate chunk size: 	 1x192x4320 		 3.3MB
Intermediate number of files: 		 197640
Target chunk size: 		 8784x24x72 		 60.7MB
Target number of files: 		 10800

The above approach spills about 200 000 intermediate files that are between 1 and 3MB which is a bit lower than what I was told to use on CNES cluster.
max_mem is 30GB.

So I do only a partial transposition instead, which looks like:

{'time': 2196, 'face': 1, 'i_g': 144, 'j': 48}
Individual chunk size = 15.2 MB
Source data size: 		 8784x4320x4320 	 655.7GB
Source chunk size: 		 1x4320x4320 		 74.6MB
Source number of files: 		8784
Intermediate chunk size: 	 1x768x4320 		 13.3MB
Intermediate number of files: 		 49410
Target chunk size: 		 2196x48x144 		 60.7MB
Target number of files: 		10800

i.e. about 50 000 files that are between 5 and 10 MB

So not a blocking point at all in my case as you can guess.
But this potential issue may deserve to be mentioned in the doc.

This is one of the downsides to the way rechunker and Zarr currently work. If you can think of a workaround, I'd love to know about it.

I'm also using Pangeo and rechunker mainly on HPC resources. This may be an ignorant or misguided comment but would making use of the zarr ZipStore option help here? I don't know what penalty you pay in performance zipping/unzipping but just doing an rm -rf my_file.zarr on an HPC based 2TB zarr file in DirectoryStore format with 20,000 chunks isn't exactly fast, even on a BeeGFS parallel file system.

< I better get back to attempting to rechunk this 2TB dataset >
=/