lab30
Most of the labs in the DTrace course have a description in lab#.txt and an executable. Read the txt file and run the executable. At the end of each lab txt file, there is a place for "Your response". I'll find a way to get to any/all responses.
These labs should work in both global and nonglobal zones, but not on a VM that does not support DTrace, (e.g., windows, linux bhyve
instances, and other VMs without DTrace support will not work for non-global zone DTrace use).
The labs should work on LX zones, but they'll probably have to be recompiled from source.
Remember that LX system calls are in the lx-syscall
provider, not the syscall
provider.
As part of the training, the instructor will go through some examples that require global zone root access.
The following is a description of lab30 from the DTrace course labs. It is meant as an example of analysis needed to solve the labs. If you don't have the lab files, let me know and I'll send the descriptions and binaries. Most of these labs are simple C programs that are broken in some way. I'll send the source when you've completed the labs.
# cat lab30.txt
Min Prereq: Module 3
Customer description:
Application is taking forever to process work. Why might that be?
Your response:
First, we'll run ./lab30:
# ./lab30
Can't open infile: No such file or directory
So the program doesn't take forever. It fails to work. Let's first see which system calls return errors when lab30 runs. Note that I would normally use truss
for this,
but this is about DTrace...
# dtrace -n 'syscall:::return/pid == $target && errno != 0/{printf("%s failed, errno = %d\n", probefunc, errno);}' -c ./lab30
dtrace: description 'syscall:::return' matched 235 probes
Can't open infile: No such file or directory <-- The error message printed out when running lab30
dtrace: pid 104702 has exited
CPU ID FUNCTION:NAME
0 646 open:return open failed, errno = 2 <-- The open(2) system call failed.
Let's use DTrace to find out what file is trying to be opened.
# cat lab30_open.d
#!/usr/sbin/dtrace -s
syscall::open:entry
/execname == "lab30"/
{
printf("opening %s\n", copyinstr(arg0));
self->inopen = 1;
}
syscall::open:return
/self->inopen/
{
printf("open returns %d, errno = %d\n", arg0, errno);
self->inopen = 0;
}
# chmod +x lab30_open.d
On entry to the system calls, the script prints arg0 (the name of the file being opened). On return, it prints the return value and the errno.
# ./lab30_open.d -c "./lab30"
dtrace: script './lab30_open.d' matched 2 probes
Can't open infile: No such file or directory <-- the error message from the program
dtrace: pid 131712 has exited
CPU ID FUNCTION:NAME
0 645 open:entry opening data100m <-- lab30 tries to open "data100m"
0 646 open:return open returns -1, errno = 2 <-- and fails (grep ENOENT /usr/include/sys/errno.h, or man -s 2 open)
So the program expects a file to exist whose name is data100m. Many of the labs use this file. We'll create it to be 100MB large.
# mkfile 100m data100m
# ls -l data100m
-rw------T 1 root root 104857600 Jun 3 16:12 data100m
#
And run lab30 again. We'll run it in the background (alternatively, we can run it with -c ./lab30
).
# ./lab30 &
[1] 132867
#
First, let's use prstat(1)
to see what is doing.
# prstat -c
Please wait...
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
132867 root 2052K 804K cpu48 1 0 0:00:21 1.1% lab30/1
88 root 0K 0K sleep 98 -20 5:03:21 0.1% zpool-zones/268
126332 root 15M 13M sleep 1 0 0:07:28 0.0% sshd/1
126329 root 13M 11M sleep 1 0 0:03:18 0.0% sshd/1
70 root 2744K 1532K sleep 29 0 0:00:01 0.0% pfexecd/4
7895 noaccess 2872K 1788K sleep 29 0 0:00:06 0.0% mdnsd/1
209 root 28M 24M sleep 29 0 0:00:13 0.0% syseventd/23
21 root 2928K 1628K sleep 29 0 0:00:00 0.0% dlmgmtd/7
7574 root 4012K 716K sleep 1 0 0:00:17 0.0% ipmon/1
18 netadm 4244K 2712K sleep 29 0 0:00:00 0.0% ipmgmtd/3
8018 root 2028K 1360K sleep 59 0 0:00:00 0.0% ttymon/1
9286 root 63M 52M sleep 1 0 0:01:30 0.0% vmadmd/6
7063 root 2548K 1004K sleep 22 0 0:02:32 0.0% lldpd/1
10 root 10M 9112K sleep 29 0 0:02:12 0.0% svc.configd/16
8 root 8528K 5960K sleep 29 0 0:00:07 0.0% svc.startd/15
Total: 74 processes, 868 lwps, load averages: 0.30, 0.11, 0.07
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
132867 root 2052K 804K cpu48 1 0 0:00:26 1.3% lab30/1
88 root 0K 0K sleep 98 -20 5:03:21 0.1% zpool-zones/268
126332 root 15M 13M sleep 1 0 0:07:28 0.0% sshd/1
126329 root 13M 11M sleep 1 0 0:03:18 0.0% sshd/1
70 root 2744K 1532K sleep 29 0 0:00:01 0.0% pfexecd/4
7895 noaccess 2872K 1788K sleep 29 0 0:00:06 0.0% mdnsd/1
209 root 28M 24M sleep 29 0 0:00:13 0.0% syseventd/23
21 root 2928K 1628K sleep 29 0 0:00:00 0.0% dlmgmtd/7
7574 root 4012K 716K sleep 1 0 0:00:17 0.0% ipmon/1
18 netadm 4244K 2712K sleep 29 0 0:00:00 0.0% ipmgmtd/3
8018 root 2028K 1360K sleep 59 0 0:00:00 0.0% ttymon/1
9286 root 63M 52M sleep 1 0 0:01:30 0.0% vmadmd/6
7063 root 2548K 1004K sleep 22 0 0:02:32 0.0% lldpd/1
10 root 10M 9112K sleep 29 0 0:02:12 0.0% svc.configd/16
8 root 8528K 5960K sleep 29 0 0:00:07 0.0% svc.startd/15
Total: 74 processes, 868 lwps, load averages: 0.37, 0.13, 0.08
So lab30 is using ~1.3% of the available cpus. This machine has 56 cores (psrinfo
output will tell you for your machine), so that is roughly 70% of one core.
Using prstat
to take a closer look.
# prstat -mcLvp 132867
Please wait...
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWP
132867 root 73 27 0.0 0.0 0.0 0.0 0.0 0.0 0 455 .70 0 lab30/1
Total: 1 processes, 1 lwps, load averages: 1.07, 0.95, 0.56
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWP
132867 root 73 27 0.0 0.0 0.0 0.0 0.0 0.0 0 454 .70 0 lab30/1
Total: 1 processes, 1 lwps, load averages: 1.07, 0.95, 0.57
...
So lab30 is spending 73% of its time running user level code (code that makes up lab30), and 27% of the time running in system mode (most likely, handling system calls). Let's see what system calls lab30 is making. We'll look at the user side in a later session of the course.
# dtrace -n 'syscall:::entry/pid == $target/{@[probefunc] = count();} tick-10s{exit(0);}' -p 132867
dtrace: description 'syscall:::entry' matched 236 probes
CPU ID FUNCTION:NAME
1 86843 :tick-10s
read 9564857
In 10 seconds, lab30 did 9564857 read system calls. Let's see what file(s) are being read.
# dtrace -n 'syscall::read:entry/execname == "lab30"/{@[fds[arg0].fi_pathname] = count();} tick-10s{exit(0);}'
dtrace: description 'syscall::read:entry' matched 2 probes
CPU ID FUNCTION:NAME
32 86843 :tick-10s
/var/tmp/max/data100m 5961350
#
Always the data100m
file. Let's check return values.
# dtrace -n 'syscall::read:return/pid == $target/{@[arg1, errno] = count();} tick-10s{exit(0);}' -p 132867
dtrace: description 'syscall::read:return' matched 2 probes
CPU ID FUNCTION:NAME
30 86843 :tick-10s
0 0 9685394
#
So read(2)
always returns 0, with errno equal to 0 (i.e., no errors). Let's check the number of bytes requested on the read call. This is the 3rd argument to read(2)
.
]# dtrace -n 'syscall::read:entry/pid == 132867/{@["read size: ", arg2] = count();} tick-10s{exit(0);}'
dtrace: description 'syscall::read:entry' matched 2 probes
CPU ID FUNCTION:NAME
14 86843 :tick-10s
read size: 0 8535882
#
So read
is being called with 0 bytes requested.
This is most likely due to a bug in the code. Possibly a variable is not being initialized correctly. You should want to look at the code for read(2)
system calls.
That's an example. Please do not use truss
to look at these problems, as truss will often point you in the right direction (i.e., you can probably solve most of these using truss
, and not learn any DTrace. Good luck!