****************************************************************************** GPI-2 http://www.gpi-site.com Version: 1.3.0 Copyright (C) 2013-2016 Fraunhofer ITWM ****************************************************************************** 1. INTRODUCTION =============== GPI-2 is the second generation of GPI (www.gpi-site.com). GPI-2 implements the GASPI specification (www.gaspi.de), an API specification which originates from the ideas and concepts of GPI. GPI-2 is an API for asynchronous communication. It provides a flexible, scalable and fault tolerant interface for parallel applications. 2. INSTALLATION =============== Requirements: ------------- The current version of GPI-2 has the following requirements. Software: - libibverbs v1.1.6 (Verbs library from OFED) if running on Infiniband. - ssh server running on compute nodes (requiring no password). - gawk (GNU Awk) and sed utilities. Hardware: - Infiniband/RoCE device or Ethernet device. The easiest way to install GPI-2 is by using the install.sh script. The default settings install GPI-2 under /opt/GPI2/. This location can be easily modified by passing the location with the -p option to the install script. For example, ./install.sh -p /prog/GPI2 installs GPI-2 under /prog/GPI2 instead of under the default location. Infiniband support (default) ---------------------------- GPI-2 requires the libibverbs from the OFED stack. You can pass the path of your OFED installation to the install script using the option (--with-infiniband) for that, in case the install script is not able to find it: ./install.sh --with-infiniband<=full_path_to_ofed> You can see the install script usage: ./install.sh -h Ethernet support ---------------- To install GPI-2 on a system without Infiniband, you can use the option --with-ethernet. With this option standard TCP sockets are used. Such support is primarily targetted at the development of GPI-2 applications without the need to access a system with Infiniband, with less focus on performance. MPI Interoperability -------------------- If you plan to use GPI-2 with MPI to, for instance, start an incremental port of a large application or to use some libraries that require MPI, you can enable MPI interoperability with the option: ./install.sh --with-mpi<=path_to_mpi_installation> If you don't provide a path to your MPI installation and mpirun is in your PATH, the installation script will take that MPI installation. This will enable you to use GPI-2 with MPI where the only constraint is that MPI_Init() must be invoked before gaspi_proc_init(). For this MPI+GPI2 mixed mode, it is assumed that you start the application with mpirun (or mpiexec, etc.). Note that this option will build GPI-2 with MPI, requiring you to link the MPI library to your GPI-2 application (even if you are not using MPI). If you are interested in using GPI-2 only, do not build GPI-2 with this option. GPU/CUDA support ------------------- If you plan to use GPI-2 with CUDA and to allow direct data transfer between GPUs, you enable GPU support with ./install.sh --with-cuda<=path_to_cuda_installation> If you don't provide a path to your CUDA installation and nvcc is in your PATH, the installation script will take that CUDA installation. GPI-2 with CUDA support actually requires GPUDirect RDMA support, so it will be checked if the required kernel-module is installed Due to the limitations of GPUDirect RDMA, the maximal GPU memory segment size is limited to 250MB. To enable GPU support in your application, call gaspi_init_GPUs (in include/GASPI_GPU.h). The function gaspi_number_of_GPUs will return the number of supported GPUs on the actual NUMA-node while the function gaspi_GPU_ids returns the device Ids of the supported GPUs. 3. BUILDING GPI-2 ================= You can build GPI-2 on your own. There are the following make targets: all - Build everything gpi - Build the GPI-2 library (including debug version) fortran - Build Fortran 90 bindings mic - Build the GPI-2 library for the MIC tests - Build provided tests docs - Generate documentation (requires doxygen) clean - Clean-up 4. BUILDING GPI-2 APPLICATIONS ============================== GPI-2 provides two libraries: libGPI2.a and libGPI2-dbg.a. The libGPI2.a aims at high-performance and is to be used in production whereas the libGPI2-dbg.a provides a debug version, with extra parameter checking and debug messages and is to be used to debug and during development. 5. RUNNING GPI-2 APPLICATIONS ============================= The gaspi_run utility is used to start and run GPI-2 applications. A machine file with the hostnames of nodes where the application will run, must be provided. For example, to start 1 process per node (on 4 nodes), the machine file looks like: node01 node02 node03 node04 Similarly, to start 2 processes per node (on 4 nodes): node01 node01 node02 node02 node03 node03 node04 node04 The gaspi_run utility is invoked as follows: gaspi_run -m <machinefile> [OPTIONS] <path GASPI program> IMPORTANT: The path to the program must exist on all nodes where the program should be started. The gaspi_run utility has the following further options [OPTIONS]: -b <binary file> Use a different binary for first node (master). The master (first entry in the machine file) is started with a different application than the rest of the nodes (workers). -N Enable NUMA for processes on same node. With this option it is only possible to start the same number of processes as NUMA nodes present on the system. The processes running on same node will be set with affinity to the proper NUMA node. -n <procs> Start as many <procs> from machine file. This option is used to start less processes than those listed in the machine file. -d Run with GDB (debugger) on master node. With this option, GDB is started in the master node, to allow debugging the application. -p Ping hosts before starting the binary to make sure they are available. -h Show help. 5. THE GASPI_LOGGER =================== The gaspi_logger utility is used to view and separate the output from all nodes when the function gaspi_printf is called. The gaspi_logger is started, on another session, on the master node. The output of the application, when using gaspi_printf, will be redirected to the gaspi_logger. Other I/O routines (e.g. printf) will not. A further separation of output (useful for debugging) can be achieved by using the routine gaspi_printf_to which sends the output to the gaspi_logger started on a particular node. For example, gaspi_printf_to(1, "Hello 1\n"); will display the string "Hello 1" in the gaspi_logger started on rank 1. 6. GPI-2 WITH CO-PROCESSOR (INTEL XEON PHI) ========================================== GPI-2 can be used with the Intel Xeon Phi (MIC) where the MIC is used as one node (running possibly more than one GPI-2 process). To use GPI-2 on the MIC, you need to compile it using the Intel compiler with the -mmic option. The Makefile includes a build target mic for that end but this build target is not used by the installation script. After successful compilation, GPI-2 for the MIC can be found under lib64/mic. If you are having problems with the compilation, make sure the OFED_PATH and OFED_MIC_PATH in src/make.inc is setup correctly. Assuming that the MIC(s) is set up properly, GPI-2 requires: - ssh connectivity (requiring no password) - grouping of local host and MIC(s) by using the tool micctrl micctrl --initdefaults micctrl --addbridge=br0 --type=internal --ip=172.31.1.254 micctrl --network=static --bridge=br0 --ip=172.31.1.1 where for instance, mic0: 172.31.1.1, mic1: 172.31.1.2 etc. The MIC must be visible to the whole cluster network. For instance, if your cluster network is 192.168.1.0 on eth0 and you want to map mic0 to 192.168.1.100 do: iptables -t nat -A PREROUTING -d 192.168.1.100 -i eth0 -j DNAT --to-destination 172.31.1.1 iptables -t nat -A POSTROUTING -s 172.31.1.1 -o eth0 -j SNAT --to-source 192.168.1.100 After setup, the MIC hostname can simply be added to the machinefile and be used as another host. 7. GPU/CUDA SUPPORT =================== To enable GPU support in your application, call gaspi_init_GPUs. This function will check how many supported GPUs are available. The function gaspi_number_of_GPUs will return the number of supported GPUs on the actual NUMA-node, the function gaspi_GPU_ids returns the device Ids of the supported GPUs. Segment Creation ----------------- To create a GPU memory segment, use the flag GASPI_MEM_GPU. The segment will be allocated on the current active GPU. Note that the active GPU may change by calling gaspi_init_GPUs and therefore should be set new before creating a GPU memory segment. Segment pointer ---------------- The segment pointer of a GPU memory segment is a GPU device pointer. It can be used in CUDA-related functions (eg. cudaMemcpy) and in CUDA-kernels. However, a direct access to host memory results in a segmentation fault. Host memory segment -------------------- Of course, GPU memory segments can be used together with host memory segments. If GPU-support is enabled, a new host memory segments is also registered for the GPUs. Therefore, a segment pointer of a local host memory segments can be used with asynchronous CUDA copy operations for asynchronous local data transfers. Special functions -------------------- Due to some issues with the PCIe-Express bus, the bandwidth of direct GPU-to-GPU transfers is limited for larger messages sizes. Therefore, gaspi_gpu_write and gaspi_gpu_write_notify are optimized for remote write operations with a GPU memory segment as source. These functions use one-sided copies in host memory. Therefore, gaspi_gpu_write is one-sided, but not fully asynchronous with respect to the host, since the host thread is required to control the data flow. Small data packages are still transfered directly. However, it can still be overlapped with computation on the GPU. gaspi_write and gaspi_write_notify still can be used if a full asynchronous, but slower data transfer is preferred. Limitations ------------- Since the maximal size pinnable GPU memory is limited to less than 256MB on most GPUs, the maximal size of a GPU memory segment is also limited to this size. However, this issue can be removed by using a firmware patch for the current K20 and K40 GPUs and will be removed in future versions. To support direct data transfers, the Infiniband device and the GPU must be located on the same NUMA socket. Otherwise, direct data transfers are not supported. GPI-2 checks this at the initialization. 8. TROUBLESHOOTING AND KNOWN ISSUES =================================== If you're having troubles when building GPI-2 with support for Infiniband, make sure you have the OFED stack correctly installed and running. You can change the OFED path in the definitions file make.inc in the source directory (src) to the fix path in your system. When installing GPI-2 with MPI mixed-mode support (using the option --with-mpi<=path_to_mpi_installation>) and the installation is failing when trying to build the tests due to missing libraries, you can provide extra paths through available variables and be able to find the missing paths. These are: GPI2_EXTRA_CFLAGS, GPI2_EXTRA_LIBS_PATH, GPI2_EXTRA_LIBS. For example: GPI2_EXTRA_LIBS=-lmpl ./install.sh --with-mpi -p /opt/GPI2 Environment variables --------------------- You might have some trouble when your application requires some dynamically set environment setting (e.g. the LD_LIBRARY_PATH), for instance, through the module system of your jobs batch system. Currently, neither the gaspi_run or the GPI-2 library take care of such environment settings. To this situation there are 2 workarounds: i) you set the required environment variables in your shell initialization file (e.g. ~/.bashrc). ii) you create an executable shell script which sets the required environment variables and then starts the application. Then you can use gaspi_run to start the application, providing the shell script as the application to execute. gaspi_run -m machinefile ./my_wrapper_script.sh where my_wrapper_script.sh contains: #!/bin/sh LD_LIBRARY_PATH=$LD_LIBRARY_PATH:<path_to_my_lib> <path_to_my_application>/my_application <my_app_args> exit $? If you're running in MPI mixed-mode, starting your application with mpirun/mpiexec, this should not be an issue. 9. UP COMING FEATURES ===================== GPI-2 is on-going work and more features are still to come. Here are some that are in our roadmap: - support to add spare nodes (fault tolerance) - better debugging possibilities 10. MORE INFORMATION ==================== For more information, check the GPI-2 website ( www.gpi-site.com ) and don't forget to subscribe to the GPI-2 mailing list. You subscribe it at https://listserv.itwm.fraunhofer.de/mailman/listinfo/gpi2-users