/lltop

Lustre load monitor with batch scheduler integration

Primary LanguageCGNU General Public License v2.0GPL-2.0

                               *Lltop*

Lltop[0] is a command line utility which gathers I/O statistics from
Lustre[1] filesystem servers, along with job assignment data from
cluster batch schedulers, to give a job-by-job accounting of
filesystem load.  Under typical usage, lltop is invoked with the name
of a filesystem, runs for a configurable interval (10 seconds say),
and outputs a table summarizing I/O and RPC loads indexed by job
identifier; for example:

  $ lltop work
  JOB      WR_MB    RD_MB    REQS      OWNER  WORKDIR
  12101    15925    67630  133694   jfourier  /work/jfourier/fftw_run
  10322     2254     1027    2504     claude  /work/claude/viscous-flow-08
  13007      756    21024   10007     ludwig  /work/ludwig/boltzeq.mvapich2
  ...

Normally, lltop is run in response to observations of excessive load
on file servers or degraded filesystem performance, and is used to
assist system administrators in identifying jobs (and users) with
problematic I/O patterns.  A potential secondary use is to determine
the I/O profiles of applications running at scale.  lltop is designed
to run as a point and shoot diagnostic utility, and is not a
replacement for continuous monitoring tools such as LMT[2] or
Collectl[3].

                              *Overview*

Lltop has two executable components, lltop itself, and lltop-serv.
lltop is usually run directly and given the name of a filesystem to
query.  From the filesystem name, it derives a list of servers (MDSs
and OSSs), and for each it forks and execs ssh to run a copy of
lltop-serv on the server.

On the server, lltop-serv scrapes the per-client stats files

  /proc/fs/lustre/{mds,obdfilter}/<target>/exports/<client>/stats

to determine each client's load in terms of bytes written, bytes read,
and requests processed.  It actually makes two passes through the
stats files[4], sleeping for a configurable interval between, and
returns the differences.  The output of lltop-serv consists of lines
[5] of the form

  <ipv4-addr>@<lnet-net-name> <wr_B> <rd_B> <reqs>

where

  <ipv4-addr>@<lnet-net-name> is the client address according to Lustre,
  for example 192.0.32.10@tcp,
  <wr_B> and <rd_B> are the number of bytes written and read,
  <reqs> is the number of request other than pings[6].

Lltop reads this output and translates client addresses to hostnames,
and hostnames to jobids[7, 8], to account for each client's load against
its current job.  If lltop cannot find a job assignment for a given
client then considers the client to be the sole member of a job whose
jobid is the clients hostname.  Similarly, if lltop cannot find a
hostname for a given client IP address, it uses the address as the
clients name and current jobid.  This allows us to handle load
generated by login or admin nodes in the same band.

                         *Configuring lltop*

To get lltop to work on your site you probably need to override some
of the default configuration.  Most of this can be accomplished
through command line options, but the source is organized so that the
same effects (and more) can be acheived by modifying the functions in
hooks.c.  Here are the main things you may need to do, along with some
suggestions.

1. Tell lltop on which servers it should run lltop-serv.  You have
three options:

  a. Modify the function get_serv_list() in hooks.c, so that lltop may
  be invoked with the filesystem name as an argument.

  b. Use the -l (--server-list) option to specify a list of servers
  directly:

    lltop -l mds1.example.com oss{01..27}.example.com

  c. Provided that FILESYSTEM is mounted on the current host, use some
  crazy pipeline, like:

    sed 's/@.*$//' /proc/fs/lustre/{mdc,osc}/FILESYSTEM-*/*_conn_uuid | sort | uniq | xargs lltop -l

2. Tell lltop how to translate Lustre client addresses (usually dotted
quads with the @<lnet-net-name> stripped off) to hostnames.  How well
does reverse DNS work at your site?  If the answer is "Uhhh, not real
well.", or if you have some weird LNET with a weird address format
like qswlnd, whatever that is, then keep reading, otherwise skip to 3.
The default address to host lookup uses getnameinfo(), which should
work fine given a correct site config.  If not, here are three
possibilities:

  a. Using getnameinfo_get_host() as a template, add the function
  my_site_get_host() to hooks.c and tell lltop to use it.

  b. Use the -g (--get-host) option to specify an external command
  which should take the address as its only argument and print a
  hostname.  If it succeeds, your exernal command should return 0,
  otherwise lltop will treat the dotted quad as if it is the client's
  hostname.

  c. Fix /etc/hosts, /etc/nsswitch.conf, /etc/resolv.conf,..., so
  that getnameinfo() works on the host where you run lltop.

3. Tell lltop how to lookup the current job for a host.  Lltop was
originally written for TACC Ranger which uses SGE for batch
scheduling.  Under that setup the JOBID of the current job on HOST is
determined from the existence of a file

  /share/sge6.2/execd_spool/HOST/active_jobs/JOBID.*

This is the default method in lltop.  Otherwise:

  a. If you run SGE but you need to override the execd_spool path then
  do so by modifying hooks.c or passing --execd-spool=PATH.

  b. Using execd_spool_get_job() as a template, add the function
  my_site_get_job() to hooks.c and tell lltop to use it.

  c. Use the -j (--get-job) option to specify an external command to
  do job lookup.  It should function like the external host lookup
  command described above.

  d. Use the -m (--job-map) option to specify an external command
  which produces a "job map."  This is useful if you use something
  like qhost for job lookup, since using 'qhost -j -h <host>' to get
  the current job of a single takes about the same time as calling
  'qhost -j' to get the current job of all nodes at once.  See the
  attached script qhost_job_map.

                          *Installing lltop*

Run make, put lltop somewhere in your path on an admin node, put
lltop-serv somewhere in your path on the Lustre servers.  Also see the
included script tacc_lltop which we use to add job owner and workdir
to the output of lltop.

                            *Getting Help*

$ lltop --help
Usage: lltop [OPTION]... FILESYSTEM
  or:  lltop [OPTION]... -l SERVER...
Report load by job for Lustre FILESYSTEM or SERVER(s).

Mandatory arguments to long options are mandatory for short options too.
  -f, --fqdn               use fully qualified domain names for clients
  -g, --get-host=COMMAND   use COMMAND for reverse DNS lookups
  -h, --help               display this help and exit
  -i, --interval=NUMBER    report load over NUMBER seconds
  -j, --get-job=COMMAND    use COMMAND for job lookup
  -l, --server-list        report load on servers given as arguments
  -m, --job-map=COMMAND    use COMMAND to get job map
  -n, --limit=NUMBER       limit output to NUMBER jobs
      --no-header          do not display header
      --lltop-serv=PATH    use lltop-serv at PATH on servers
      --remote-shell=PATH  use remote shell at PATH to execute lltop-serv
      --execd-spool=PATH   use execd_spool directory PATH for job lookup

lltop GitHub repository: <https://github.com/jhammond/lltop>

Otherwise, please send me any comments, questions, improvements.  I am
especially interested in receiving/including any code/scripts to do
job lookup for batch schedulers other than SGE.  Please, put lltop in
the subject line.

John L. Hammond
TACC, The University of Texas at Austin
<jhammond@tacc.utexas.edu>

--

0. lltop is a recursive anagram of lltop.

1. According to the headers, Lustre is a trademark of Sun
Microsystems.

2. Lustre Monitoring Tool: http://code.google.com/p/lmt/

3. Collectl: http://collectl.sourceforge.net/

4. Note that lltop-serv does not clear the stats files.  In fact
clearing stats files while lltop-serv is running may cause it to
misreport or under report usage.  Client evictions can also affect the
accuracy of the data returned, but lltop-serv does use some simple
heuristics to mitigate their effects.  However it should be remembered
that lltop is not an exact tool and should be used with judgement.

5. Lltop-serv does not count pings because doing so tends to distort
the statistics for large jobs.

6. As an optimization, if a client fails to geterate any load during
the interval, then lltop-serv omits that client from its output.

7. Lltop keeps a cache of address to jobid mappings so that the
hostname and jobid lookups are done at most once per client.

8. If your site runs multiple concurrent jobs on single hosts then it
may be hard to adapt lltop.  I welcome suggestions on how to handle
this case.