*Lltop* Lltop[0] is a command line utility which gathers I/O statistics from Lustre[1] filesystem servers, along with job assignment data from cluster batch schedulers, to give a job-by-job accounting of filesystem load. Under typical usage, lltop is invoked with the name of a filesystem, runs for a configurable interval (10 seconds say), and outputs a table summarizing I/O and RPC loads indexed by job identifier; for example: $ lltop work JOB WR_MB RD_MB REQS OWNER WORKDIR 12101 15925 67630 133694 jfourier /work/jfourier/fftw_run 10322 2254 1027 2504 claude /work/claude/viscous-flow-08 13007 756 21024 10007 ludwig /work/ludwig/boltzeq.mvapich2 ... Normally, lltop is run in response to observations of excessive load on file servers or degraded filesystem performance, and is used to assist system administrators in identifying jobs (and users) with problematic I/O patterns. A potential secondary use is to determine the I/O profiles of applications running at scale. lltop is designed to run as a point and shoot diagnostic utility, and is not a replacement for continuous monitoring tools such as LMT[2] or Collectl[3]. *Overview* Lltop has two executable components, lltop itself, and lltop-serv. lltop is usually run directly and given the name of a filesystem to query. From the filesystem name, it derives a list of servers (MDSs and OSSs), and for each it forks and execs ssh to run a copy of lltop-serv on the server. On the server, lltop-serv scrapes the per-client stats files /proc/fs/lustre/{mds,obdfilter}/<target>/exports/<client>/stats to determine each client's load in terms of bytes written, bytes read, and requests processed. It actually makes two passes through the stats files[4], sleeping for a configurable interval between, and returns the differences. The output of lltop-serv consists of lines [5] of the form <ipv4-addr>@<lnet-net-name> <wr_B> <rd_B> <reqs> where <ipv4-addr>@<lnet-net-name> is the client address according to Lustre, for example 192.0.32.10@tcp, <wr_B> and <rd_B> are the number of bytes written and read, <reqs> is the number of request other than pings[6]. Lltop reads this output and translates client addresses to hostnames, and hostnames to jobids[7, 8], to account for each client's load against its current job. If lltop cannot find a job assignment for a given client then considers the client to be the sole member of a job whose jobid is the clients hostname. Similarly, if lltop cannot find a hostname for a given client IP address, it uses the address as the clients name and current jobid. This allows us to handle load generated by login or admin nodes in the same band. *Configuring lltop* To get lltop to work on your site you probably need to override some of the default configuration. Most of this can be accomplished through command line options, but the source is organized so that the same effects (and more) can be acheived by modifying the functions in hooks.c. Here are the main things you may need to do, along with some suggestions. 1. Tell lltop on which servers it should run lltop-serv. You have three options: a. Modify the function get_serv_list() in hooks.c, so that lltop may be invoked with the filesystem name as an argument. b. Use the -l (--server-list) option to specify a list of servers directly: lltop -l mds1.example.com oss{01..27}.example.com c. Provided that FILESYSTEM is mounted on the current host, use some crazy pipeline, like: sed 's/@.*$//' /proc/fs/lustre/{mdc,osc}/FILESYSTEM-*/*_conn_uuid | sort | uniq | xargs lltop -l 2. Tell lltop how to translate Lustre client addresses (usually dotted quads with the @<lnet-net-name> stripped off) to hostnames. How well does reverse DNS work at your site? If the answer is "Uhhh, not real well.", or if you have some weird LNET with a weird address format like qswlnd, whatever that is, then keep reading, otherwise skip to 3. The default address to host lookup uses getnameinfo(), which should work fine given a correct site config. If not, here are three possibilities: a. Using getnameinfo_get_host() as a template, add the function my_site_get_host() to hooks.c and tell lltop to use it. b. Use the -g (--get-host) option to specify an external command which should take the address as its only argument and print a hostname. If it succeeds, your exernal command should return 0, otherwise lltop will treat the dotted quad as if it is the client's hostname. c. Fix /etc/hosts, /etc/nsswitch.conf, /etc/resolv.conf,..., so that getnameinfo() works on the host where you run lltop. 3. Tell lltop how to lookup the current job for a host. Lltop was originally written for TACC Ranger which uses SGE for batch scheduling. Under that setup the JOBID of the current job on HOST is determined from the existence of a file /share/sge6.2/execd_spool/HOST/active_jobs/JOBID.* This is the default method in lltop. Otherwise: a. If you run SGE but you need to override the execd_spool path then do so by modifying hooks.c or passing --execd-spool=PATH. b. Using execd_spool_get_job() as a template, add the function my_site_get_job() to hooks.c and tell lltop to use it. c. Use the -j (--get-job) option to specify an external command to do job lookup. It should function like the external host lookup command described above. d. Use the -m (--job-map) option to specify an external command which produces a "job map." This is useful if you use something like qhost for job lookup, since using 'qhost -j -h <host>' to get the current job of a single takes about the same time as calling 'qhost -j' to get the current job of all nodes at once. See the attached script qhost_job_map. *Installing lltop* Run make, put lltop somewhere in your path on an admin node, put lltop-serv somewhere in your path on the Lustre servers. Also see the included script tacc_lltop which we use to add job owner and workdir to the output of lltop. *Getting Help* $ lltop --help Usage: lltop [OPTION]... FILESYSTEM or: lltop [OPTION]... -l SERVER... Report load by job for Lustre FILESYSTEM or SERVER(s). Mandatory arguments to long options are mandatory for short options too. -f, --fqdn use fully qualified domain names for clients -g, --get-host=COMMAND use COMMAND for reverse DNS lookups -h, --help display this help and exit -i, --interval=NUMBER report load over NUMBER seconds -j, --get-job=COMMAND use COMMAND for job lookup -l, --server-list report load on servers given as arguments -m, --job-map=COMMAND use COMMAND to get job map -n, --limit=NUMBER limit output to NUMBER jobs --no-header do not display header --lltop-serv=PATH use lltop-serv at PATH on servers --remote-shell=PATH use remote shell at PATH to execute lltop-serv --execd-spool=PATH use execd_spool directory PATH for job lookup lltop GitHub repository: <https://github.com/jhammond/lltop> Otherwise, please send me any comments, questions, improvements. I am especially interested in receiving/including any code/scripts to do job lookup for batch schedulers other than SGE. Please, put lltop in the subject line. John L. Hammond TACC, The University of Texas at Austin <jhammond@tacc.utexas.edu> -- 0. lltop is a recursive anagram of lltop. 1. According to the headers, Lustre is a trademark of Sun Microsystems. 2. Lustre Monitoring Tool: http://code.google.com/p/lmt/ 3. Collectl: http://collectl.sourceforge.net/ 4. Note that lltop-serv does not clear the stats files. In fact clearing stats files while lltop-serv is running may cause it to misreport or under report usage. Client evictions can also affect the accuracy of the data returned, but lltop-serv does use some simple heuristics to mitigate their effects. However it should be remembered that lltop is not an exact tool and should be used with judgement. 5. Lltop-serv does not count pings because doing so tends to distort the statistics for large jobs. 6. As an optimization, if a client fails to geterate any load during the interval, then lltop-serv omits that client from its output. 7. Lltop keeps a cache of address to jobid mappings so that the hostname and jobid lookups are done at most once per client. 8. If your site runs multiple concurrent jobs on single hosts then it may be hard to adapt lltop. I welcome suggestions on how to handle this case.