/plfs-regression

LANL no longer develops PLFS. Feel free to fork and develop as you wish.

Primary LanguagePython

This is the regression test suite for PLFS.

Description:
The regression suite will pull in needed sources into its environment and
run a series of tests on that configuration. It is designed to be self-
reliant; it should not call on anything outside its directory structure to
accomplish anything when running tests. This way the regression
environment is easily considered by looking at what is currently in the 
regression directory structure and log files.

A regression run has four phases:
- build phase: all dependencies are retrieved and built, if neccessary.
- configuration and directory check: check required configuration files
  of all packages and that the plfs-related directories can be created.
- test submittal: tests are submitted. If no tests are submitted (either by
  configuration or by errors), this phase becomes the final phase for a
  regression run. If specified, a summary email will be sent.
- test checking: if at least one test is run, this last stage is entered.
  Each test that was run will have it's results checked and logged. If
  specified, a summary email will be sent.

Information on writing tests for the regression suite is located in:
tests/README.

Dependencies:
- What ever compiler that is desired should be setup either through login
  scripts or in the regression suite's config file (see below).
- The following packages/programs are required:
  - Python >= 2.5
  - PLFS: can be either in the form of source or an already built version.
    PLFS source is used in at least one test so is required even when
    providing an already built version of PLFS. When given source, the
    regression suite will build PLFS.
  - MPI (for running parallel tests): can be either in the form of a source
    tarball or an already patched for PLFS MPI installation. If the tarball is
    given, MPI will be built.
    NOTE: if MPI is to be built, make sure the correct version of autotools
    (m4, autoconf, automake, libtool) are already in your PATH and
    LD_LIBRARY_PATH. Otherwise, the building of MPI may fail.
    NOTE: at this time, only Open MPI tarballs are supported.
  - fs_test (for running specific IO patterns and many tests use it): can be
    in the form of source or an already compiled version. It is available in
    the fs-test project (https://github.com/fs-test). If source is given,
    fs_test will be compiled against the regression suite's PLFS and MPI.
  - experiment_management (for creating test commands tailored to the system
    being run on): available in the fs-test project
    (https://github.com/fs-test). This package does not require complilation.
  - job scheduler (for submitting and tracking parallel tests): the regression
    suite relies on the command 'checkjob' to determine if submitted jobs have
    completed or not. Additionally, the following line is expected to be in
    checkjob's output:
      Status <status>
    The regression suite will look for the following values for <status>:
    Completed or Removed. If the job scheduler on the system does not provide
    checkjob with that format, the regression suite will be unable to determine
    when jobs complete and will error out. It is known that the MOAB job
    scheduler provides this functionality.
- A working ~/.plfsrc file consistent with the version of plfs that is to be
  tested. It is not necessary that all of the directories for the PLFS mount
  points and backends exist. The regression suite will ensure that these
  directories exist and will attempt to create them if they are missing. The
  backends should be located on a file system space that will be shared
  between all nodes (compute nodes as well as the node that will be running
  run_plfs_regression.sh). Make sure that the user has write permissions for
  all paths. Using mount_points and backends that the user does not have
  permissions to create but does have write permissions on is permissible only
  if all of the required directories are already created. In this way, a
  system configuration of PLFS can be tested. Tests within the regression suite
  will use these mount points, experiment_management input (see below) and
  run-time information to construct file targets that will be used during the
  test. Without the applicable experiment_management setting, targets will be
  placed directly into the mount point.
  Note about the check for directories and plfs rc files:
  The checks for plfs related directores and rc files happen after the build
  phase. If build phase was successfull but any of the checks fail, it is not
  necessary to rerun the build phase. Instead, use run_plfs_regression.sh's
  '--nobuild' option to skip the build phase. The checks will still be done
  before the test submittal phase.
  Fuse tests that run on compute nodes will make sure the mount_point
  specified in the plfsrc is present. This way it is possible to use a
  location such as /var/tmp, but users will not have to log in to each compute
  node individually to make sure the mount point exists. This checking for the
  mount_point's existance is actually done on any fuse test regardless of
  where the test runs if the test was written correctly.
- In regards to experiment_management, an rc file must be present and define
  experiment_management's runcommand, ppn, and outdir as individual tests will
  not pass these on the command line when invoking experiment_management. In
  addition, msub should be defined to include "-j oe" as many of the tests
  expect to look for only one output file.
  Note: outdir should be a relative path for now. It is expected that this
  directory will be placed in the test's directory, and absolute paths don't
  guarantee this.
  Note about the check for experiment_management rc files:
  Like the check for plfs rc files, the check for experiment_management rc
  files is done after the build stage. If the build phase was successfull,
  but the check on experiment_management rc files fails, it is not necessary
  to rerun the build phase; use '--nobuild' to skip the build stage. The test of
  the rc files will be done before the test submittal stage.
- A config file in the Regression directory. A sample is provided as
  config.sample. Copy this file as config (in the same directory) and edit
  config accordingly. All information related to specifying where to find
  dependencies will be entered in that file. It is heavily commented in an
  attempt to explain all the different options available.
- MY_MPI_HOST environment variable set as in the experiment_management
  framework. This variable needs to be set in such a way that it is defined
  on compute nodes. Set it in log-in scripts or using the env_customize.sh
  script (covered in the Quick Start section).
- Define MPI_CC environment variable as the command to compile parallel
  programs. For example, set it to mpicc (generic mpi implementation) or cc
  (cray implementation).
- If email capabilities are needed/wanted, /bin/mail needs to be available.
- If Open MPI is to be built by the regression suite, a platform file must
  be used when compiling Open MPI to make it aware of the locations of the
  PLFS library. A sample platform file is provided as
  openmpi_platform_file.sample. It contains the minimum settings necessary
  to make Open MPI aware of the PLFS library. It can be copied to
  openmpi_platform_file and modified, but the REPLACE_PLFS_* characters must
  remain so that the regression suite can accurately generate a usable
  platform file. The regression suite will first check for the existance of
  openmpi_platform_file. If it is not present, the regression suite will
  check for openmpi_platform_file.sample. This way, if a custom platform file
  is not needed, users can do nothing and the sample will be used.
- If putting test target files directly into the PLFS mount point is not
  desired, specify rs_mnt_append_path in experiment_management's rc file.
  Assign it a path relative to the mount point. Tests will append this path to
  the path of the mount point in constructing paths for test file targets:
  /<mount_point>/<rs_mnt_append_path>/<test file target>
  The regression suite will make sure the directory specified by
  rs_mnt_append_path exists on all backends which will ensure that it exists
  as /<mount_point>/<rs_mnt_append_path>.

Usage:
./run_plfs_regression.sh [OPTIONS ...]

Running run_plfs_regression.sh without options will start a regression run
with a configuration consistent with what is set up in the config file and
default values. Please see config.sample for descptions of all of the options.
Use run_plfs_regression.sh's '-h|--help' switch to list the options that can be
used and their effect on the parameters in the config file. Parameters set on
the command line over-ride those set in the config file.

Quick Start/Sanity Check:
1) Start run_plfs_regression.sh with the --notests command line switch after
   setting up the config file. Deal with any issues in building. Logs are
   located in the logs directory.
2) Run a simple test: tests/write_read_no_error or tests/adio_write_read.
   Both of these tests are simple write and read tests.
   Change directory into either of the above and run ./reg_test.py. A single
   job should be submitted. When the job has finished, run ./check_results.py.
   Deal with any issues in getting the test to run; this may require doing
   step 1 again.
   If there are environment issues, consider creating
   tests/utils/env_customize.sh to deal with them. This file will be sourced
   by tests/ustils/rs_env_init.sh, which is the main script that makes sure
   the environment is set up to use the regression suite's executables and
   libraries. /tests/utils/env_customize.sh will be sourced by
   run_plfs_regression.sh and by tests running on compute nodes, so it may be
   necessary to put conditionals in env_customize.sh to make sure the proper
   hosts are all getting the right environment. Env_customize.sh must be
   written in bash or sh.
   As an example, a library is found by default on a machine's frontend, but
   not on the compute nodes. Env_customize.sh could modify LD_LIBRARY_PATH
   based on compute nodes names such that the library will be found when it
   comes time to run binaries on the compute nodes.
3) If the above worked, it should be ok to run the whole regression suite
   (e.g. run_plfs_regression.sh --nobuild --testtypes=2,3)

Files and directories in the regression suite:
- There should only ever be two entry points into the regression suite:
  run_plfs_regression.sh and check_tests.py. As such, there are only two exit
  points for the regression suite: from those same files. However, it should be
  noted that the regression scripts have been built with the idea of modularity
  in mind. That is, it should be possible to run each script separately if need
  be.
- config.sample: sample config file for configuration. A similar file should be
  in the Regression directory named config.
- run_plfs_regression.sh: The main regression script. Will build the
  needed source code or link to already existing binaries and libraries. If
  there are any issues with these steps, run_plfs_regression.sh will exit the
  regression run and send an email notification to addresses specified in the
  config file. If all needed packages are present, this script will call
  submit_tests.py to submit tests and then pass control to check_tests.py.
- submit_tests.py: python script that will submit tests that are located in the
  tests directory. It will use a control file to know which tests to submit.
  Submit_tests.py file is where we specify how many different types of tests can
  be run. To add more types of tests, edit this file in the main and parse_args
  methods.
- check_tests.py: python script that waits for tests to finish and then report
  on results. It is designed to be run on it's own. That is,
  run_plfs_regression.sh passes control to check_tests.py and then exits.
  Check_tests.py expects to finish up the regression on its own. It can also
  be started at any time to re-check tests. It can be instructed to wait for
  jobs to complete or immediately check the results of tests. At this time,
  check_tests.py requires a file that lists what tests were submitted. See
  below for more information.
- restart_check_tests.sh: Script that can be run through cron to find out if
  check_tests.py should be restarted. Since check_tests.py is suppose to wait
  for tests to complete, but there is no process to report to 
  (run_plfs_regression.sh has already exited), it is possible that the
  check_tests.py process dies before the whole regression run is complete.
  This script is to make sure check_tests.py finishes at some point and sends
  a report.
- plfs_build.sh: build plfs and install it into the installation directory (given
  in config file).
- openmpi_build.sh: build openmpi and install it. This script does not check
  for the correct version of autotools. Open MPI's autogen.sh checks for this.
- fs_test_build.sh: builds fs_test source.
- src_cp.sh: copy source from one location into anther. This script can be used
  to copy any set of code into the regression environment (plfs, fs_test, etc).
- src: Directory that will contain all source needed by the regression suite.
  All code will be copied into this directory in it's own directory. The
  directory names will be hardcoded into run_plfs_regression.sh so that it can
  always expect to find code in certain places.
- inst: Directory that will be used to install packages to. Each package will
  install to its own directory within inst. This is where tests should expect
  to find all binaries and libraries that the regression suite provides.
  NOTE: if already existing executables and libraries are given to the
  regression suite, then this directory or its sub-directories will contain
  symbolic links to the locations given in the configuration. This is to
  ensure that tests always know where to find dependencies.
- logs: Directory that will be used to hold log files.
- tests: Directory that will contain individual test directories. Each
  directory is considered a test. Lists of tests to run are kept in text
  files in the tests directory.
- A lock file will be created by run_plfs_regression.sh and can be removed by
  run_plfs_regression.sh and check_results.py. It will also house job ids for
  submitted jobs that can be checked by check_results.py. The actual file name
  is set in the config file.
  NOTE: submit_tests.py will also create this file if it doesn't exist, but
  this functionality is there only to allow submit_tests.py to be run on its
  own outside of the other regression scripts. A message is printed out if
  submit_tests.py creates the file so that the behaviour can be tracked.
- Check_tests.py requires the use of a file that contains a python dictionary
  detailing what tests were submitted by submit_tests.py. The name of this
  file can be specified in the config file. This dictionary will have
  information on tests that were successfully submitted as well as tests that
  were not successfully submitted. All of this information is necessary in
  order for check_tests.py to create a suitable summary.

Notes
- The regression suite will rely on the MPI implementation's wrapper scripts
  (such as mpicc, mpif90, etc.). This way the regression suite doesn't have to
  worry about MPI library names and getting the linking right. As at least one
  MPI implementation doesn't make sure its own MPI libraries are properly
  linked into the executable (no use of rpath), the regression suite will
  set LD_LIBRARY_PATH to include a path to the MPI libraries.
- For fs_test compilation, the regression suite will over-write MPI_LD and
  MPI_INC to have the proper flags to compile fs_test against the regression
  suite's plfs and mpi. This change is not permanent to the user's
  environment, just the environment set up by the regression suite. These
  environment variables should be available if individual tests need to
  compile supporting programs.
- It is possible for the regression suite to get itself into a state where it
  cannot run. This is usually due to not being able to work with the needed
  jobs id file and the python dictionary file of what jobs to check. These
  are specified in the config file as id_file and dict_file, respectively.
  When these files can not be found when needed or are not usable, the
  regression suite will create a DO_NOT_RUN_REGRESSION. The presence of this
  file will keep future instances of the regression suite from running. At the
  time that this file is created, and error should be printed out as to what
  went wrong with what file. If email is enabled, an email with the same
  information should be sent. It is expected that user intervention is
  required to get the regression suite back into working order.

Issues/To do:
- Check_tests.py for now requires a file that contains a python dictionary of
  the tests that submit_tests.py submitted. This was done so that information
  that submit_tests.py finds in submitting tests can be passed to
  check_tests.py (basically submitted vs. unable to submit at this time).
  However, this means that it is not quite simple to just rerun
  check_tests.py after a regression run is completed since the dictionary file
  is removed. Is this important? Maybe could have submit_tests.py leave
  that additional information in a file inside the test's directory, but the
  test does nothing with it (it would be a regression suite file, not a test
  file). This way, check_tests.py could go through the same control file that
  submit_tests.py went through to figure out what tests to check, and look
  for that file and act accordingly. However, what if the control file
  changes? This is probably the original reason why the dictionary file method
  is used.
  Maybe add an option to check_tests.py to not remove the dictionary file? Or
  better yet, to not remove it at all from check_tests.py and have
  submit_tests.py rewrite it each regression instance. Will this bring about
  confusion about what regression instance spawned the file?

- This regression suite doesn't run on the Cray XE6 machines (possibly any
  Cray). There are two reasons for this:
  1) The regression suite relies on a ssh approach when mounting the needed
     PLFS mount points; ssh to the Cray compute nodes is not possible at this
     time. All interaction with the compute nodes has to happen through aprun.
  2) Given that we have to use aprun to interact with the compute nodes, it
     would be nice if we could mount PLFS using aprun. However, when aprun
     exits, all user-generated processes are killed which includes the PLFS
     daemon. This brings about the situation where fuse thinks PLFS is still
     running but it really isn't. In order to make the mount point usable
     again, it has to be unmounted through aprun. Basically, anything that has
     to be done to a PLFS mount point that the regression suite mounts has to
     be done within a single aprun command.
  Given the above reasons, there are no plans to implement the regression
  suite on the Cray machines. In order to do this, tests would have to be
  completely rewritten to remove all parallel run commands (aprun, mpirun,
  etc.) from the scripts that get submitted and that script would be run
  via a single parallel run command. Given that a lot of the generated
  scripts are in a shell language and shells have no concept of MPI, it
  would be impossible to implement some tests in this fashion unless a
  different language is used. For example, write_read_error relies on a single
  rank running a program that will change a single byte in a PLFS file. If the
  generated script were run on all nodes, every rank would try to modify the
  file since the generated script is in bash. The only way to solve this would
  be to change the language of the generated script to one that is MPI-aware.