PFIO: a High-Performance Client-Server I/O Layer

Introduction

GEOS-5 related applications (such as GEOSgcm, GEOSctm, GEOSldas, GCHP, etc.) produce a lot of outputs file that consist of several file collectinos that are created at different time frequencies. As the model resolution increases, the amount of data produced significantly grows, exacerbating the file system specially if one processor is in charge of writing out all files.

PFIO, a subcomponent of the MAPL package was designed to facilitate the productions of model output files (organized in collections) in a distributed computing environment. PFIO asynchonically creates output files therfore allowing the model to proceed with calculations without waiting for the I/O to be completed. This leads to a decrease of the overall model integration time.

In the context of GEOS-5, the available nodes (cores) are split into two groups:

The computing nodes that are reserved for model calculations
The I/O nodes that are grouped to form the PFIO Server

All the file collections to be generated by the MAPL HISTORY (MAPL_History) subcomponent, are routed through the PFIO server that will distribute the output files to the I/O nodes (based on the user's configuration set at run time). One of the features of PFIO is that it can be set to run the standard Message Passing Interface (MPI) root processor configuration. This can be important if the model is integrated at low resolution and/or rgenerates a few file collections.

In this document, we explain how and when to configure the PFIO Server to run on separate resources. We also provide general recommendations on how to properly configure the PFIO Server in order to get the best possible performance. It is important to note that it is up to users to run their application multiple times to determine the optimal PFIO Server configuration.

MpiServer Class

The clients sends the data to Oserver.
All processors in Oserver would coordinate to create different shared memory windows for different collections.
The processors use one-sided MPI_PUT to send the data to the shared memory.
Different collections are written by different processors. Those writing processors are distributed among nodes as eveny as possible.
All the other processors have to wait for the wrting processors to finish jobs before responding to Clients’ next round of requests.

MultiGroupServer Class

The oserver is devided into frontend and backend.
When the frontend receive the data, its root process asks backend‘s root (or head) for an idle process for each collection<. Then it broadcasts the info to the other frontend processes.
When the frontend processors forward (MPI_SEND) the data to the backend ( different collections to different backend processors), they get back to the clients without waiting for the actual writing.

Command Line

There are two options to submit an executable through the mpi_run or mpi_exec command.

Regular MPI Command

If the regular mpirun or mpiexec command is used:

    mpirun -np npes ExeccuTable

the MpiServer is used as oserver. The client processes are overlapping with oserver processes. The client and oserver are sequetically working together. When client sends data, it actually makes a copy, then the oserver takes over the work, i.e., shuffling data and writing data to the disk. After MpiServer is done, the client moves on.

Command with IOserver Options

`n1` processes for the model and `n2` processes for the `MpiServer`

    mpirun -np npes ExeccuTable –npes_model n1 --npes_output_server n2

Note that $npes$ is not necessary equal to $n1+n2$.
The client (model) will use the minimum number of nodes that contain $n1$ cores.
- For example, if each node has n processors, then $npes = \lceil \frac{n1}{n} \rceil \times n + n$.
If --isolate_nodes is set to false (by default, it is true), the oserver and client can co-exist in the same node, and $npes = n1 + n2$.
--npes_output_server n2 can be replaced by --nodes_output_server n2. Then the $npes = \lceil \frac{n1}{n} \rceil \times n + n2 \times n$.

`n1` processes for the model and `n2` processes for the `MultiGroupServer`

    mpirun -np npes ExeccuTable –npes_model n1 --npes_output_server n2 --oserver_type multigroup --npes_backend_pernode n3

For each node of oserver, $n3$ processes are used as backend.
For example, if each node has $n$ cores, then $npes = \lceil \frac{n1}{n} \rceil \times n + n2 \times n$.
The frontend has $n2 \times (n-n3)$ processes and the backend has $n3 \times n$ processes.
The frontend has $\lceil \frac{n2}{n} \rceil \times (n-n3)$ processes and the backend has $n3 \times n$ processes.

Passing a vector of `oservers`

    mpirun -np npes ExeccuTable –npes_model n1  --npes_output_server n2 n3 n4

The command creates $n2$-node, $n3$-nodes and $n4$-nodes MpiServer.
The oservers are independent. The client would take turns to send data to different oservers.
If each node has $n$ processors, then $npes = \lceil \frac{n1}{n} \rceil \times n + (n2+n3+n4) \times n$.
Advantage: Since the oservers are independent, the client has the choice to send the data to the idle oserver.
Disavantage: Finding an idle oserver is not easy.

Passing a vector of `oservers` and the `MultiGroupServer`

    mpirun -np npes ExeccuTable –npes_model n1  --npes_output_server n2 n3 n4 --oserver_type multigroup --npes_backend_pernode n5

The command creates $n2$-node, $n3$-nodes and $n4$-nodes MultiGroupServer.
The oservers are independent. The client would take turns to send data to different oservers.
If each node has $n$ processors, then $npes = \lceil \frac{n1}{n} \rceil \times n + (n2+n3+n4) \times n$.
Each oserver has $n2 \times n5$, $n3 \times n5$, and $n4 \times n5$ backend processes respectively.

`MpiServer` using one-sided `MPI_PUT` and shared memory

   mpirun -np npes ExeccuTable –npes_model n1 --npes_output_server n2 --one_node_output true

The option --one_node_output true makes it easy to create n2 oservers and each is one-node oserver.
It is equivalent to --nodes_output_server 1 1 1 1 1 ... with n2 “1”s.

Additional Options

--fast_oclient true

After the client sends history data to the oserver, by default it waits and makes sure all the data is sent even it uses non-blocking isend. If this option is set to true, the client copies the data before non-blocking isend. It waits and cleans up the copies next time when it re-uses the oserver.

Example

The file pfio_MAPL_demo.F90 is a standalone program that implement the use of PFIO. It writes several time records of 2D and 3D arrays. The compilation of the program generates the executable, pfio_MAPL_demo.x. If we reserve 2 haswell nodes (28 cores in each), run the model on 28 cores and use 1 MultiGroup with 5 backend processes, then the execution command is:

    mpiexec -np 56 pfio_MAPL_demo.x --npes_model 28 --oserver_type multigroup --nodes_output_server 1 --npes_backend_pernode 5

The frontend has $28-5=23$ processes and the backend has $5$ processes.

Performance Analysis

We create a collection that contains:

one 2D variable (IMxJM)
one 3D variable (IMxJMxKM)

Three (3) 'daily' files are written out and each of them contains six (6) time records. We measure the time to perform the IO operations. Note that no calculations are involved here. We only do the array initialization.

PFIO has a profiling tool which is exercised by passing the command line option: --with_io_profiler true

mpiexec -np 56 $MAPLBIN/pfio_MAPL_demo.x --npes_model 28 --oserver_type multigroup --nodes_output_server 1 --npes_backend_pernode 5 --with_io_profiler true

It returns the following timing statistics:

Inclusive: all time spent between start and stop of a given timer.
Exclusive: all time spent between start and stop of a given timer _except_ time spent in any other timers.
o_server_front:
--wait_message: Time while the front ends is waiting for the data from application.
--add_Histcollection: Time for adding history collections.
--receive_data: The total time Frontends receive data from applications.
----collection_1: The time Frontends receive collection_1.
--forward_data: The total time Frontends forward data to Backend.
----collection_1: The time Frontends forward collection_1.
--clean up: The time finalizing o-server.

IM=360 JM=181 KM=72

with 5 Backend PEs/node

    =============
    Name                 Inclusive % Incl Exclusive % Excl Max Excl  Min Excl  Max PE Min PE
    i_server_client       0.324201 100.00  0.324201 100.00  0.520954  0.245613  0016   0023

    Final profile
    =============
    Name                 Inclusive % Incl Exclusive % Excl Max Excl  Min Excl  Max PE Min PE
    o_server_front        0.357244 100.00  0.053738  15.04  0.881602  0.013470  0000   0002
    --wait_message        0.047207  13.21  0.047207  13.21  0.052244  0.040038  0011   0013
    --add_Histcollection  0.003346   0.94  0.003346   0.94  0.005641  0.000294  0002   0007
    --receive_data        0.194778  54.52  0.000496   0.14  0.000696  0.000367  0013   0019
    ----collection_1      0.194282  54.38  0.194282  54.38  0.421234  0.113870  0013   0021
    --forward_data        0.057849  16.19  0.017939   5.02  0.051281  0.000058  0020   0018
    ----collection_1      0.039910  11.17  0.039910  11.17  0.048129  0.030721  0018   0019
    --clean up            0.000325   0.09  0.000325   0.09  0.000529  0.000244  0009   0017

In the table below, we report the Inclusive time for the two main IO components as the number of backend PEs per node varies:

Number of Backend PEs/node	i_server_client	o_server_front
1
2	1.186932	1.813097
3	0.291334	1.216281
4	0.259511	0.296956
5	0.324201	0.357244

IM=720 JM=361 KM=72

with 5 Backend PEs/node

    =============
    Name                 Inclusive % Incl Exclusive % Excl Max Excl  Min Excl  Max PE Min PE
    i_server_client       1.050624 100.00  1.050624 100.00  1.515223  0.822786  0015   0025

    Final profile
    =============
    Name                 Inclusive % Incl Exclusive % Excl Max Excl  Min Excl  Max PE Min PE
    o_server_front        1.250806 100.00  0.128693  10.29  2.737311  0.008478  0000   0012
    --wait_message        0.108261   8.66  0.108261   8.66  0.130712  0.081595  0008   0022
    --add_Histcollection  0.003061   0.24  0.003061   0.24  0.004589  0.001020  0004   0002
    --receive_data        0.789012  63.08  0.000642   0.05  0.000909  0.000484  0013   0019
    ----collection_1      0.788370  63.03  0.788370  63.03  1.568300  0.406615  0013   0021
    --forward_data        0.221412  17.70  0.102570   8.20  0.378546  0.000081  0021   0018
    ----collection_1      0.118842   9.50  0.118842   9.50  0.145169  0.090811  0013   0021
    --clean up            0.000367   0.03  0.000367   0.03  0.000552  0.000256  0004   0012

In the table below, we report the Inclusive time for the two main IO components as the number of backend PEs per node varies:

Number of Backend PEs/node	i_server_client	o_server_front
1
2	3.378511	5.795466
3	0.977153	6.262224
4	1.009190	1.203735
5	1.050624	1.250806

Implementation of PFIO in LIS

PFIO was added to the LIS code as a library. A new file, LIS_PFIO_HistoryMod.F90 was created to include PFIO related calls to produce LIS HISTORY. PFIO offers the option (see the above section) to run the code using either the standard mpirun command or an IO server (reserved for producing HISTORY only). Because the LIS code only has one collection (one HISTORY file), only one output server node (with any number of backend cores) is needed to use PFIO. Basically, we will have the command (assuming that we have 28 cores per node):

   set num_cores_per_node = 28
   set           tot_npes = 224
   @ comp_npes = ${tot_npes} - ${num_cores_per_node}
   mpiexec -n $tot_npes $EXE --npes_model $comp_npes --oserver_type multigroup --nodes_output_server 1 --npes_backend_pernode 2

Therefore, there is only one command line configuration when PFIO is used in LIS.

For PFIO to be effective in LIS, we need at least two requirements:

The process to produce the HISTORY files is signficantly more expensive than the calculations.
The elapsed time between the full creation (writing into disk) two consecutive HISTORY files is less than the model integration time.
- If not, the ouput node might be continually oversubcribed.
- By principle in PFIO, the frontend processors (FPs) forward the data to the backend and they get back to the clients (compute processors) without waiting for the actual writing of the data. It is possible for the clients to send new requests to the FPs while the FPs are still sending data to the backend.

The LIS code does not have a profiler. In all the experiments we have done, we measure the total elapsed times. We plan to integrate a profiling tool in LIS that will allow us to better capture the time it takes to execute various components of the code.

Test Case

Model Configuration

5901x2801 grid points
one-day integration with output produced every 3 hours (8 files with one record each)
The fields to be written out are:
- 80 2D fields
- 4 3D fields (with 4 levels)

Without any data compression, each output file produced here requires 6.43 Gb. Our goal here is not only to reduce the time spent on IO but also to decrease the file size by applying data compression.

In all the experiment we did, the LIS/PFIO code produced bitwise identical results as the original version of the code.

Results with 112 Compute Cores

We ran the orginal version of the LIS code (ORG) and the one with the PFIO implementation (PFIO) using a $14 \times 8$ processor decomposition. As the data compression level varies, we recorded the average output file size (out of 8 files) and the total time to complete the integration.

It is assumed that the PFIO option uses one additional node (reserved for output) with repected to the original version of the code.

Deflation Level	Average File Size		Total Time (s)
	ORG	PFIO	ORG	PFIO
0	6.43	6.43	734	1023
1	1.76	1.71	1484	1213
3	1.74	1.65	1928	1403
5	1.75	1.61	2121	1670
7	1.74	1.59	3388	2376
9	1.73	1.58	3948	8297

Results when the Number of Compute Cores Varies

We use the data compression level of 1 and let the number of compute cores varies. We record the total time for a one-day integration and the HISTORY file created every three hours.

# Compute Cores	ORG	PFIO	Gain/Loss
$112=14 \times 8$	1484	1213	+18%
$140=14 \times 10$	1340	1198	+11%
$168=14 \times 12$	1292	1272	+1.5%
$196=14 \times 14$	1267	1374	-8.4%

As the number of computee cores increases, PFIO becomes less attractive. It is more likely due to the fact PFIO is not done before the model completes the calculations, therefore creation data congestion in the output server node.

Test Case 1

5901x2801 grid points
Four-day integration with output produced every 1.5 hours (73 files with one record each)
The fields to be written out are:
- 10 2D fields
- 2 3D fields (with 4 levels)
Each file without data compression has a size of 412 Mb.

	84	112	140	168	196	224	504	1008
Run Method	817.88	614.58	491.55	410.19	350.77	307.14	137.03	68.35
Output	395.95	334.99	299.88	290.55	268.05	252.47	364.71	216.24
Overall	1495.24	1224.56	1062.53	968.48	889.67	825.61	687.49	555.93

Table 1.1 LIS: timiming statistics as the number of processors varies.

# compute cores	# IO Nodes	Run Method	Output	Overall
224	1	312.93	56.82	604.91
	2	309.48	33.12	565.33
504	1	137.37	942.47	1289.03
	2	140.17	193.81	553.24
	3	139.95	78.22	425.78
644	3	114.77	134.53	447.00
	4
784	3	89.97	284.14	574.70
	4
1008	4	69.97	217.53	495.42
	5	70.01	135.00	397.69

Table 1.2 LIS/PFIO: timiming statistics as the number of compute processors and the number of IO node vary. In each case, we use 2 backend cores per IO nodes and set the number virtual output collections to be two times the number of IO nodes.

Test Case 2

5901x2801 grid points
one-day integration with output produced every 1.5 hours (24 files with one record each)
The fields to be written out are:
- 80 2D fields
- 4 3D fields (with 4 levels)
Each file without compression has a size of 6.43 Gb.

	84	112	140	168	196	224	1008
Run Method	373.71	280.62	224.90	186.62	160.74	140.67	31.24
Output	1209.94	1229.54	1194.51	1228.61	1436.07	1420.89	1331.73
Overall	2269.26	2171.23	2069.12	2059.28	2235.15	2188.14	2050.97

Table 2.1 LIS: timiming statistics as the number of processors varies.

# compute cores	# IO Nodes	Run Method	Output	Overall
224	2	143.93	1021.35	1963.34
	3	144.25	535.86	1422.43
504	2	64.10	2106.05	4799.04
	3	64.07	1631.45	2358.15
1008	6	31.81	1341.00	2270.67
	8	31.81	603.19	1899.65

Table 2.2 LIS/PFIO: timiming statistics as the number of compute processors and the number of IO node vary. In each case, we use 2 backend cores per IO nodes and set the number virtual output collections to be two times the number of IO nodes.

JulesKouatchou/PFIO_guide