MPI_Import =========== A collection of experimental modules the Python import bottleneck on Cray and similiar super-computers: We currently have three approaches, MPI_Import.py, cached_import.py and mpiimport.pyx. The first two, developed in 2011 by Asher Langton, focused on reducing the number of meta requests to the IO subsystem; The latest addition is mpiimport.pyx in 2014 by Yu Feng. It focuses on reducing the overall IO burden to the subsystem. Supercomputers are unique; one may work better than another depending on the architecture of the computer and also depending on the scale of the jobs. 1) MPI_Import.py --------------- MPI_Import.py is a replacement for the standard import mechanism. It can only be used in sections of the code that are synchronous, as every process depends on rank 0 to handle the the lookup portion of a module import. For details and usage instructions, see the docstring at the beginning of MPI_Import.py, as well as the following discussion: http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html 2) cached_import.py ---------------- cached_import.py is an attempt to improve on the performance and useability of MPI_Import through the use of PEP 302 finders and loaders. See the PyData panel discussion (starting aroung 1:07:00) and my post to the NumPy list for some background: http://marakana.com/s/2012_pydata_workshop_panel_with_guido_van_rossum,1091/index.html http://mail.scipy.org/pipermail/numpy-discussion/2012-March/061160.html There are two approaches. In the first (finder, along with mpi4py_finder and pympi_finder), the importer creates a dictionary of all modules in sys.path and uses this for any subsequent imports. Some performance details of that approach are here: https://groups.google.com/d/msg/mpi4py/h_GDdAUcviw/zx9BB8FKFZIJ The second approach (simple_finder) avoids the overhead of calling stat on every potential module file, and maintains a dictionary of the contents of sys.path, as well as the contents of package subdirectories as they're visited, and then uses the standard Python probing import method. This appears to be the best approach so far. 3) mpiimport.pyx ---------------- mpiimport.pyx is a parallel replacement to the python import mechanism. The approach avoids IO on the shared file system as much as possible. This results more predictable performance on shared large distributed computers, as the amount of IO operations is drastically reduced. It replaces the import command with a meta-path hook that performs a parallel import. It comes with a minimalistic MPI binding (based on libmpi.pxd from mpi4py) to avoid pulling in the dependency of mpi4py before the parallel import is initialized. The .py files and .so files are read on the 0-th rank of MPI_COMM_WORLD, then broadcast to the computing nodes: a) .py files: the computing nodes 'exec' them right-away; b) .so files: the computing nodes dump them to the fast memory based tmp file system, then load the .so files. In addition, the expensive site.py is replaced with a three stage initialization, where the examination of .pth files are also limited on 0-th rank. To use the module, compile with make Then copy mpiimport.so and site.py to the python script directory. Add the following lines to the head of the script: import mpiimport; mpiimport.install() modify the scripts invoking the python script, replacing 'python' with 'python -S'. Tests on BlueWaters (test.py uses mpiimport; test0.py doesn't use import hooks; importing numpy and mpi4py, then quit) Highlights: 1) 3 mins of start-up time for a 16K core job 2) 50% walltime reduction for start-up 3) 95% reduction of IO on shared file system [ with mpiimport: lustre io ~= inblocks - outblocks. an equal amount of blocks are written / read to /tmp. without mpiimport: lustre io ~= inblocks ] 4) about 60 seconds for pure python and mpiimport startup. + aprun -n 16384 python -S test.py python ready to go at 2014-09-03 13:46:55.856611 IO: 0.117565 (97) LOAD: 5.87299 (14) LOADDirect: 30.3157 (9) LOADFile: 0 (0) COMM: 68.6621 (173) FIND: 31.3206 (173) ALL: 122.83 (1) 2723405 Application 5648763 resources: utime ~806734s, stime ~36413s, Rss ~54492, inblocks ~84057878, outblocks ~72271061 real 3m18.867s user 0m0.056s sys 0m0.156s + date +%H:%M:%S.%N 13:47:07.387158124 + aprun -n 16384 python test0.py python ready to go at 2014-09-03 13:54:13.405897 Application 5648777 resources: utime ~48268s, stime ~70637s, Rss ~48136, inblocks ~309473401, outblocks ~15009 real 7m17.607s user 0m0.068s sys 0m0.120s