fmihpc/vlsv

writing XML footer is very slow

galfthan opened this issue · 17 comments

In large simulations close() is quite slow, on Hornet it used 45% of total write time (2.9GB vlsv file, 200 nodes).

We were misled by our tests in Hornet and Voima assuming results would be the same. On voima (lfs stripe count 1) the footer writing takes very little time while on hornet it takes ages (10 stripes though).

I tested with one ost on hornet then the footer writing takes again 0.02 seconds out of 80 s spent in IO. So it seems the footer does not like to write on a striped file.

Could this footer be written using MPI writing instead?

I can try to replicate this is voima and see if I can make it faster.

I can replicate this is voima.

I set electric sail simulation to use 130 nodes and 2600 MPI processes. Spatial grid size 480x480, 600x600x1 velocity blocks bounding box.

  • initial state: 115 GB in 84.7 s, 1.36 GB/s data rate (stripe=8)
    • VLSV open: 0.13 s
    • VLVS close: 28.1 s
  • restart: 112.3 GB in 148.7 s, 1.39 GB/s
    • VLSV open: 0.16 s
    • VLSV close: 67.5 s

Seems like it's the resizing of the large file which takes most time. It is possible to speed up the footer writing by using MPI calls by about 50% instead of the master-only fstream, but it's still not terribly fast.

After the previous change,

  • initial state:
    • VLSV close: 9.9 s
  • restart:
    • VLSV close: 8.8 s

Changes are in close-speedup branch, I'll try to make this even faster.

Hmm apparently resizing the VLSV file to correct size before writing the data doesn't help at all, so the slowness must really be due to striping or something.

Please try this in Hornet.

  • initial state: 115 GB in 83.5 s, 1.38 GB/s data rate, VLSV close 0
  • restart: 112.3 GB in 82.2 s, 1.37 GB/s data rate, VLSV close 0

I can verify that this solved the issue on Hornet! With 2.9GB files with a stripe count of 10 the effective data rate was 1.3 GB/s. This number also includes some computation in the datareducers.
Close had no impact on performance, took 5 ms / file.

If the data on output files looks good, I'll merge this to master.

The data looks good. Restart speed on Hornet has also improved a bit, the record is now ~3.9GB/s with a stripe count of 42. All of that time (>99%) is spent writing the distribution data, so any further optimization would have to take place there.

Cray/Lustre has many of the same collective MPI optimizations in place that are in Adios, so trying to tune the MPI hints might be the next step.

I tried a few (to do with collective buffering), which were recommended on some Cray centers documentation, but did not see much improvement. On the other hand my testing was not very comprehensive so it does not prove much.

I also did an experiment where I changed the last file io in vlsv write to collective calls, and then turned on the no_independent_io flag (this is not exact name, see man intro_mpi). It supposedly helps at least with open performance. I saw no difference with this flag on, so I did not bother pushing that vlsv variant. On much larger core counts there might be some effect...

As mentioned on flowdock, vlasiator runs on hornet do not restart with a "Cell migration failed" error so there is probably some catch still...

No that was related to another issue with not update local cell cache.

This issue has been resolved, closing it.