Tuesday, August 7, 2007

An impromptu GUI for strace when monitoring a forking server

I used to have a coworker named Tom where it was joked among us that he had only two possible answers to any technical hurdle posed to him - strace or tcpdump. As it turns out, more often than not, he was right.

strace gives visibility into OS system calls made by a running process. tcpdump allows sniffing of network traffic and is often used to ferret out networking issues. For monitoring an application as it interfaces with a host OS and/or its interactions over a TCP/IP network, these two tools are indispensable.

I once deployed a web application that during startup failed on account of a missing file. As best as I could tell, the file in question was present. Count on strace to get to the bottom of things. It turned out to be a simple misconfiguration - the file lay situated in a parent directory of the code instead of a local directory. This was easily divulged when I ran strace and could readily see all file open operations during the start up phase. What I was able to readily infer from its output was much more meaningful to me than Java code complaining that it could not open a file and then immediately terminating (but not divulging the path of the file it was trying to open).

Strace is powerful but its output can be quite noisy if the application being monitored has a high level of activity. If per chance monitoring a forking server such as Apache or Postfix is desired then strace really gets noisy. Making sense of the "spaghetti" that is returned back from system calls associated with a parent/child process tree is not fun. While it is possible to have strace latch onto a specific child process, errant behavior that is occurring may not be happening in the process being monitored but instead a non-monitored sibling process. Therefore using strace to have complete visibility into a forking server where the output is readily digestable is problematic.

I found myself precisely in this boat as I wanted to monitor the file activity of a Postfix instance running under Solaris. In Solaris, strace's equivalent is truss. After not too long here is the solution I devised:

lsof |
grep TCP |
grep smtpd |
awk '{print $2}' |
sort |
uniq |
perl -lane ' system("/usr/X/bin/xterm \"-sb\" \"-sl\" \"1000\" \"-e\" \"/usr/bin/truss\" \"-p\" \"$_\" &") '

lsof is a tool that will list all open file handles on the operating system, including network file handles. I specifically was looking for processes that had TCP/IP network connections (first filter). Then I looked for smtpd which is the name by which Postfix is listed in the process table (second filter). After which I used awk to pluck all the Postfix process ids which are in the second column of lsof's output (third filter). I sorted these (fourth filter) and removed any duplicates (fifth filter). Finally, this list of process ids was fed into a PERL one liner (sixth filter) that spawns xterms with the -e option. This option launches an application inside of an xterm, in this case truss.

Employing Cygwin's X Server port on my Windows XP desktop, an xterm appeared for each Postfix process running under Solaris (see image above), with truss showing me all the system calls for each of them in real time. Outside of the fact that this approach yields information that is readily consumable for spur of the moment diagnostics, another advantage is that when a process terminates, the corresponding xterm housing truss monitoring that process disappears off the desktop. Which is good visual feedback for taking action depending on the context of the situation.

For PERL to correctly interpret my intentions, I had to escape the double quotes. Furthermore when xterm is launched in this fashion, double quotes are necessary to delimit the command line arguments. So the sample code is harder to read than if one were to interactively fire off an xterm interactively against each and every process like so:

xterm -sb -sl 1000 -e truss -p process_id_in_question

This statement says to fire up truss inside of an xterm against the process id specified after the -p argument with a scroll bar (-sb) and with a 1000 line scroll back buffer (-sl 1000).

Canonically, child processes of forking servers are ephemeral. That is, a child process on a forking server will handle a given number of requests and then terminate. This design philosophy insures that one child cannot live long enough to consume all available system resources, either by accident or malice (serving as a proxy for Denial of Service attacks).

Therefore coming up with a script that hard codes process ids is pointless as their number and ids will always vary over time. Whereas this technique fetches process ids on the fly and couples this with the ability of an xterm to house an application (truss in this case), the end result being an ad hoc GUI for stracing/trussing a forking server.

1 comment:

Coffee Addict said...

The image is unreadable, which is too bad. Could it be larger?