This is the regression test suite for PLFS. Description: The regression suite will pull in needed sources into its environment and run a series of tests on that configuration. It is designed to be self- reliant; it should not call on anything outside its directory structure to accomplish anything when running tests. This way the regression environment is easily considered by looking at what is currently in the regression directory structure and log files. A regression run has four phases: - build phase: all dependencies are retrieved and built, if neccessary. - configuration and directory check: check required configuration files of all packages and that the plfs-related directories can be created. - test submittal: tests are submitted. If no tests are submitted (either by configuration or by errors), this phase becomes the final phase for a regression run. If specified, a summary email will be sent. - test checking: if at least one test is run, this last stage is entered. Each test that was run will have it's results checked and logged. If specified, a summary email will be sent. Information on writing tests for the regression suite is located in: tests/README. Dependencies: - What ever compiler that is desired should be setup either through login scripts or in the regression suite's config file (see below). - The following packages/programs are required: - Python >= 2.5 - PLFS: can be either in the form of source or an already built version. PLFS source is used in at least one test so is required even when providing an already built version of PLFS. When given source, the regression suite will build PLFS. - MPI (for running parallel tests): can be either in the form of a source tarball or an already patched for PLFS MPI installation. If the tarball is given, MPI will be built. NOTE: if MPI is to be built, make sure the correct version of autotools (m4, autoconf, automake, libtool) are already in your PATH and LD_LIBRARY_PATH. Otherwise, the building of MPI may fail. NOTE: at this time, only Open MPI tarballs are supported. - fs_test (for running specific IO patterns and many tests use it): can be in the form of source or an already compiled version. It is available in the fs-test project (https://github.com/fs-test). If source is given, fs_test will be compiled against the regression suite's PLFS and MPI. - experiment_management (for creating test commands tailored to the system being run on): available in the fs-test project (https://github.com/fs-test). This package does not require complilation. - job scheduler (for submitting and tracking parallel tests): the regression suite relies on the command 'checkjob' to determine if submitted jobs have completed or not. Additionally, the following line is expected to be in checkjob's output: Status <status> The regression suite will look for the following values for <status>: Completed or Removed. If the job scheduler on the system does not provide checkjob with that format, the regression suite will be unable to determine when jobs complete and will error out. It is known that the MOAB job scheduler provides this functionality. - A working ~/.plfsrc file consistent with the version of plfs that is to be tested. It is not necessary that all of the directories for the PLFS mount points and backends exist. The regression suite will ensure that these directories exist and will attempt to create them if they are missing. The backends should be located on a file system space that will be shared between all nodes (compute nodes as well as the node that will be running run_plfs_regression.sh). Make sure that the user has write permissions for all paths. Using mount_points and backends that the user does not have permissions to create but does have write permissions on is permissible only if all of the required directories are already created. In this way, a system configuration of PLFS can be tested. Tests within the regression suite will use these mount points, experiment_management input (see below) and run-time information to construct file targets that will be used during the test. Without the applicable experiment_management setting, targets will be placed directly into the mount point. Note about the check for directories and plfs rc files: The checks for plfs related directores and rc files happen after the build phase. If build phase was successfull but any of the checks fail, it is not necessary to rerun the build phase. Instead, use run_plfs_regression.sh's '--nobuild' option to skip the build phase. The checks will still be done before the test submittal phase. Fuse tests that run on compute nodes will make sure the mount_point specified in the plfsrc is present. This way it is possible to use a location such as /var/tmp, but users will not have to log in to each compute node individually to make sure the mount point exists. This checking for the mount_point's existance is actually done on any fuse test regardless of where the test runs if the test was written correctly. - In regards to experiment_management, an rc file must be present and define experiment_management's runcommand, ppn, and outdir as individual tests will not pass these on the command line when invoking experiment_management. In addition, msub should be defined to include "-j oe" as many of the tests expect to look for only one output file. Note: outdir should be a relative path for now. It is expected that this directory will be placed in the test's directory, and absolute paths don't guarantee this. Note about the check for experiment_management rc files: Like the check for plfs rc files, the check for experiment_management rc files is done after the build stage. If the build phase was successfull, but the check on experiment_management rc files fails, it is not necessary to rerun the build phase; use '--nobuild' to skip the build stage. The test of the rc files will be done before the test submittal stage. - A config file in the Regression directory. A sample is provided as config.sample. Copy this file as config (in the same directory) and edit config accordingly. All information related to specifying where to find dependencies will be entered in that file. It is heavily commented in an attempt to explain all the different options available. - MY_MPI_HOST environment variable set as in the experiment_management framework. This variable needs to be set in such a way that it is defined on compute nodes. Set it in log-in scripts or using the env_customize.sh script (covered in the Quick Start section). - Define MPI_CC environment variable as the command to compile parallel programs. For example, set it to mpicc (generic mpi implementation) or cc (cray implementation). - If email capabilities are needed/wanted, /bin/mail needs to be available. - If Open MPI is to be built by the regression suite, a platform file must be used when compiling Open MPI to make it aware of the locations of the PLFS library. A sample platform file is provided as openmpi_platform_file.sample. It contains the minimum settings necessary to make Open MPI aware of the PLFS library. It can be copied to openmpi_platform_file and modified, but the REPLACE_PLFS_* characters must remain so that the regression suite can accurately generate a usable platform file. The regression suite will first check for the existance of openmpi_platform_file. If it is not present, the regression suite will check for openmpi_platform_file.sample. This way, if a custom platform file is not needed, users can do nothing and the sample will be used. - If putting test target files directly into the PLFS mount point is not desired, specify rs_mnt_append_path in experiment_management's rc file. Assign it a path relative to the mount point. Tests will append this path to the path of the mount point in constructing paths for test file targets: /<mount_point>/<rs_mnt_append_path>/<test file target> The regression suite will make sure the directory specified by rs_mnt_append_path exists on all backends which will ensure that it exists as /<mount_point>/<rs_mnt_append_path>. Usage: ./run_plfs_regression.sh [OPTIONS ...] Running run_plfs_regression.sh without options will start a regression run with a configuration consistent with what is set up in the config file and default values. Please see config.sample for descptions of all of the options. Use run_plfs_regression.sh's '-h|--help' switch to list the options that can be used and their effect on the parameters in the config file. Parameters set on the command line over-ride those set in the config file. Quick Start/Sanity Check: 1) Start run_plfs_regression.sh with the --notests command line switch after setting up the config file. Deal with any issues in building. Logs are located in the logs directory. 2) Run a simple test: tests/write_read_no_error or tests/adio_write_read. Both of these tests are simple write and read tests. Change directory into either of the above and run ./reg_test.py. A single job should be submitted. When the job has finished, run ./check_results.py. Deal with any issues in getting the test to run; this may require doing step 1 again. If there are environment issues, consider creating tests/utils/env_customize.sh to deal with them. This file will be sourced by tests/ustils/rs_env_init.sh, which is the main script that makes sure the environment is set up to use the regression suite's executables and libraries. /tests/utils/env_customize.sh will be sourced by run_plfs_regression.sh and by tests running on compute nodes, so it may be necessary to put conditionals in env_customize.sh to make sure the proper hosts are all getting the right environment. Env_customize.sh must be written in bash or sh. As an example, a library is found by default on a machine's frontend, but not on the compute nodes. Env_customize.sh could modify LD_LIBRARY_PATH based on compute nodes names such that the library will be found when it comes time to run binaries on the compute nodes. 3) If the above worked, it should be ok to run the whole regression suite (e.g. run_plfs_regression.sh --nobuild --testtypes=2,3) Files and directories in the regression suite: - There should only ever be two entry points into the regression suite: run_plfs_regression.sh and check_tests.py. As such, there are only two exit points for the regression suite: from those same files. However, it should be noted that the regression scripts have been built with the idea of modularity in mind. That is, it should be possible to run each script separately if need be. - config.sample: sample config file for configuration. A similar file should be in the Regression directory named config. - run_plfs_regression.sh: The main regression script. Will build the needed source code or link to already existing binaries and libraries. If there are any issues with these steps, run_plfs_regression.sh will exit the regression run and send an email notification to addresses specified in the config file. If all needed packages are present, this script will call submit_tests.py to submit tests and then pass control to check_tests.py. - submit_tests.py: python script that will submit tests that are located in the tests directory. It will use a control file to know which tests to submit. Submit_tests.py file is where we specify how many different types of tests can be run. To add more types of tests, edit this file in the main and parse_args methods. - check_tests.py: python script that waits for tests to finish and then report on results. It is designed to be run on it's own. That is, run_plfs_regression.sh passes control to check_tests.py and then exits. Check_tests.py expects to finish up the regression on its own. It can also be started at any time to re-check tests. It can be instructed to wait for jobs to complete or immediately check the results of tests. At this time, check_tests.py requires a file that lists what tests were submitted. See below for more information. - restart_check_tests.sh: Script that can be run through cron to find out if check_tests.py should be restarted. Since check_tests.py is suppose to wait for tests to complete, but there is no process to report to (run_plfs_regression.sh has already exited), it is possible that the check_tests.py process dies before the whole regression run is complete. This script is to make sure check_tests.py finishes at some point and sends a report. - plfs_build.sh: build plfs and install it into the installation directory (given in config file). - openmpi_build.sh: build openmpi and install it. This script does not check for the correct version of autotools. Open MPI's autogen.sh checks for this. - fs_test_build.sh: builds fs_test source. - src_cp.sh: copy source from one location into anther. This script can be used to copy any set of code into the regression environment (plfs, fs_test, etc). - src: Directory that will contain all source needed by the regression suite. All code will be copied into this directory in it's own directory. The directory names will be hardcoded into run_plfs_regression.sh so that it can always expect to find code in certain places. - inst: Directory that will be used to install packages to. Each package will install to its own directory within inst. This is where tests should expect to find all binaries and libraries that the regression suite provides. NOTE: if already existing executables and libraries are given to the regression suite, then this directory or its sub-directories will contain symbolic links to the locations given in the configuration. This is to ensure that tests always know where to find dependencies. - logs: Directory that will be used to hold log files. - tests: Directory that will contain individual test directories. Each directory is considered a test. Lists of tests to run are kept in text files in the tests directory. - A lock file will be created by run_plfs_regression.sh and can be removed by run_plfs_regression.sh and check_results.py. It will also house job ids for submitted jobs that can be checked by check_results.py. The actual file name is set in the config file. NOTE: submit_tests.py will also create this file if it doesn't exist, but this functionality is there only to allow submit_tests.py to be run on its own outside of the other regression scripts. A message is printed out if submit_tests.py creates the file so that the behaviour can be tracked. - Check_tests.py requires the use of a file that contains a python dictionary detailing what tests were submitted by submit_tests.py. The name of this file can be specified in the config file. This dictionary will have information on tests that were successfully submitted as well as tests that were not successfully submitted. All of this information is necessary in order for check_tests.py to create a suitable summary. Notes - The regression suite will rely on the MPI implementation's wrapper scripts (such as mpicc, mpif90, etc.). This way the regression suite doesn't have to worry about MPI library names and getting the linking right. As at least one MPI implementation doesn't make sure its own MPI libraries are properly linked into the executable (no use of rpath), the regression suite will set LD_LIBRARY_PATH to include a path to the MPI libraries. - For fs_test compilation, the regression suite will over-write MPI_LD and MPI_INC to have the proper flags to compile fs_test against the regression suite's plfs and mpi. This change is not permanent to the user's environment, just the environment set up by the regression suite. These environment variables should be available if individual tests need to compile supporting programs. - It is possible for the regression suite to get itself into a state where it cannot run. This is usually due to not being able to work with the needed jobs id file and the python dictionary file of what jobs to check. These are specified in the config file as id_file and dict_file, respectively. When these files can not be found when needed or are not usable, the regression suite will create a DO_NOT_RUN_REGRESSION. The presence of this file will keep future instances of the regression suite from running. At the time that this file is created, and error should be printed out as to what went wrong with what file. If email is enabled, an email with the same information should be sent. It is expected that user intervention is required to get the regression suite back into working order. Issues/To do: - Check_tests.py for now requires a file that contains a python dictionary of the tests that submit_tests.py submitted. This was done so that information that submit_tests.py finds in submitting tests can be passed to check_tests.py (basically submitted vs. unable to submit at this time). However, this means that it is not quite simple to just rerun check_tests.py after a regression run is completed since the dictionary file is removed. Is this important? Maybe could have submit_tests.py leave that additional information in a file inside the test's directory, but the test does nothing with it (it would be a regression suite file, not a test file). This way, check_tests.py could go through the same control file that submit_tests.py went through to figure out what tests to check, and look for that file and act accordingly. However, what if the control file changes? This is probably the original reason why the dictionary file method is used. Maybe add an option to check_tests.py to not remove the dictionary file? Or better yet, to not remove it at all from check_tests.py and have submit_tests.py rewrite it each regression instance. Will this bring about confusion about what regression instance spawned the file? - This regression suite doesn't run on the Cray XE6 machines (possibly any Cray). There are two reasons for this: 1) The regression suite relies on a ssh approach when mounting the needed PLFS mount points; ssh to the Cray compute nodes is not possible at this time. All interaction with the compute nodes has to happen through aprun. 2) Given that we have to use aprun to interact with the compute nodes, it would be nice if we could mount PLFS using aprun. However, when aprun exits, all user-generated processes are killed which includes the PLFS daemon. This brings about the situation where fuse thinks PLFS is still running but it really isn't. In order to make the mount point usable again, it has to be unmounted through aprun. Basically, anything that has to be done to a PLFS mount point that the regression suite mounts has to be done within a single aprun command. Given the above reasons, there are no plans to implement the regression suite on the Cray machines. In order to do this, tests would have to be completely rewritten to remove all parallel run commands (aprun, mpirun, etc.) from the scripts that get submitted and that script would be run via a single parallel run command. Given that a lot of the generated scripts are in a shell language and shells have no concept of MPI, it would be impossible to implement some tests in this fashion unless a different language is used. For example, write_read_error relies on a single rank running a program that will change a single byte in a PLFS file. If the generated script were run on all nodes, every rank would try to modify the file since the generated script is in bash. The only way to solve this would be to change the language of the generated script to one that is MPI-aware.