MPI_Import: A Python repository from rainwoodman

MPI_Import
===========

A collection of experimental modules the Python import
bottleneck on Cray and similiar super-computers:

We currently have three approaches, MPI_Import.py, cached_import.py
and mpiimport.pyx. 

The first two, developed in 2011 by Asher Langton, focused on reducing the
number of meta requests to the IO subsystem; 
The latest addition is mpiimport.pyx in 2014 by Yu Feng. It focuses on reducing
the overall IO burden to the subsystem.

Supercomputers are unique; one may work better than another depending on
the architecture of the computer and also depending on the scale of the jobs.

1) MPI_Import.py
---------------

MPI_Import.py is a replacement for the standard import mechanism. It
can only be used in sections of the code that are synchronous, as
every process depends on rank 0 to handle the the lookup portion of a
module import. For details and usage instructions, see the docstring
at the beginning of MPI_Import.py, as well as the following
discussion:

http://mail.scipy.org/pipermail/numpy-discussion/2012-January/059801.html


2) cached_import.py
----------------

cached_import.py is an attempt to improve on the performance and
useability of MPI_Import through the use of PEP 302 finders and
loaders. See the PyData panel discussion (starting aroung 1:07:00) and
my post to the NumPy list for some background:

http://marakana.com/s/2012_pydata_workshop_panel_with_guido_van_rossum,1091/index.html
http://mail.scipy.org/pipermail/numpy-discussion/2012-March/061160.html

There are two approaches. In the first (finder, along with
mpi4py_finder and pympi_finder), the importer creates a dictionary of
all modules in sys.path and uses this for any subsequent imports. Some
performance details of that approach are here:

https://groups.google.com/d/msg/mpi4py/h_GDdAUcviw/zx9BB8FKFZIJ

The second approach (simple_finder) avoids the overhead of calling
stat on every potential module file, and maintains a dictionary of the
contents of sys.path, as well as the contents of package
subdirectories as they're visited, and then uses the standard Python
probing import method. This appears to be the best approach so far.

3) mpiimport.pyx
----------------
mpiimport.pyx is a parallel replacement to the python import mechanism. 

The approach avoids IO on the shared file system as much as possible. This
results more predictable performance on shared large distributed computers, as
the amount of IO operations is drastically reduced.

It replaces the import command with a meta-path hook that performs a parallel
import. It comes with a minimalistic MPI binding (based on libmpi.pxd from
mpi4py) to avoid pulling in the dependency of mpi4py before the parallel import
is initialized.

The .py files and .so files are read on the 0-th rank of MPI_COMM_WORLD, then
broadcast to the computing nodes:  
  a) .py files: the computing nodes 'exec' them right-away; 
  b) .so files: the computing nodes dump them to the fast memory based tmp file
system, then load the .so files.

In addition, the expensive site.py is replaced with a three stage
initialization, where the examination of .pth files are also limited on 0-th
rank.

To use the module, compile with 
   make

Then copy mpiimport.so and site.py to the python script directory. 
Add the following lines to the head of the script:

   import mpiimport; mpiimport.install()

modify the scripts invoking the python script, replacing 'python' with 'python
-S'.

Tests on BlueWaters (test.py uses mpiimport; test0.py doesn't use import hooks;
importing numpy and mpi4py, then quit)

Highlights: 

   1) 3 mins of start-up time for a 16K core job
   2) 50% walltime reduction for start-up 
   3) 95% reduction of IO on shared file system 
      [ with mpiimport: 
         lustre io ~= inblocks - outblocks.
         an equal amount of blocks are written / read to /tmp.
        without mpiimport:
         lustre io ~= inblocks
      ]
   4) about 60 seconds for pure python and mpiimport startup.

+ aprun -n 16384 python -S test.py
python ready to go at 2014-09-03 13:46:55.856611
IO: 0.117565 (97)
LOAD: 5.87299 (14)
LOADDirect: 30.3157 (9)
LOADFile: 0 (0)
COMM: 68.6621 (173)
FIND: 31.3206 (173)
ALL: 122.83 (1)
2723405
Application 5648763 resources: utime ~806734s, stime ~36413s, Rss ~54492,
inblocks ~84057878, outblocks ~72271061

real    3m18.867s
user    0m0.056s
sys     0m0.156s
+ date +%H:%M:%S.%N
13:47:07.387158124
+ aprun -n 16384 python test0.py
python ready to go at 2014-09-03 13:54:13.405897
Application 5648777 resources: utime ~48268s, stime ~70637s, Rss ~48136,
inblocks ~309473401, outblocks ~15009

real    7m17.607s
user    0m0.068s
sys     0m0.120s
rainwoodman/MPI_Import