Thursday, August 9, 2007

As easy as -p -i -e

Before the advent of the World Wide Web and the decoding of the human genome, PERL had already ensconced itself in the IT world amongst *NIX system administrators almost ten years prior. For good reason. PERL provides a more complete feature set than the Bourne Shell and/or sed/awk. Then and now, all of these tools are commonly used by system administrators to automate repetitive tasks.

One of my favorite invocations of PERL is with the command line options -p -i -e. This particular combination allows for the in place editing of files, namely searching for one string and replacing it with another. Like so:

perl -p -i -e 's/original_text/replacement_text/' configuration_file

You can even qualify the -i parameter to back up the file being modified:

perl -p -i.bak -e 's/original_text/replacement_text/' configuration_file

The original file will continue to live on as configuration_file.bak.

Recently, a coworker was faced with changing a configuration file to point from one database to another on six production instances of Oracle's Application Server. Each instance in turn housed ten applications. This meant he would need to go edit sixty configuration files or use Oracle's web based administration tool and drill into it sixty times. Irrespective of which of these methods you employ, the process is tedious and error prone.

Armed with such knowledge, cobbling a solution for my coworker's plight did not take long. Logging into any number of machines via a script is one reason the Secure Shell(ssh) exists. Since I administer the boxes the changes needed to be done on, I had the requisite private cryptographic key to log into all of them without being prompted for a password.

This problem beckoned for automation since I could readily traverse all of the machines from a central point and invoke commands. In summary, I wanted to log into each machine, find the configuration files of interest, do an in place substitution to point to a different database.

Here is the solution for saving my coworker from firing up vi sixty times across six machines:

for i in `echo "mach1 mach2 mach3 mach4 mach5 mach6"`; do
ssh $i 'cd /oracle/10gR3; find . -name "data-sources.xml" | xargs perl -p -i.bak -e "s/oracle/microsoft/" ';

The for loop is a shell construct executed on the system that serves as my jumping point to reach each system where I employed ssh-agent to avoid password prompting. Ultimately the heart of what I am doing on each machine is this:

cd /oracle/10gR3
find . -name "data-sources.xml" | xargs perl -p -i.bak -e "s/oracle/microsoft/"

I situate myself in the base directory of the Oracle application server, find all the copies of the configuration file in question, data-sources.xml, then use xargs to invoke PERL to do in place substitution against each and every instance of said configuration file. Using find with the -exec option and removing the intermediate xargs invocation works equally well.

So now, instead of wasting an hour (or more) editing sixty files, we both can hit the golf course an hour early.

This experience underscores the potency of using several seemingly disparate tools to solve a problem - very much the *NIX way of doing things. Now if only I could do these kinds of things "out of the box" under Windows...

Tuesday, August 7, 2007

An impromptu GUI for strace when monitoring a forking server

I used to have a coworker named Tom where it was joked among us that he had only two possible answers to any technical hurdle posed to him - strace or tcpdump. As it turns out, more often than not, he was right.

strace gives visibility into OS system calls made by a running process. tcpdump allows sniffing of network traffic and is often used to ferret out networking issues. For monitoring an application as it interfaces with a host OS and/or its interactions over a TCP/IP network, these two tools are indispensable.

I once deployed a web application that during startup failed on account of a missing file. As best as I could tell, the file in question was present. Count on strace to get to the bottom of things. It turned out to be a simple misconfiguration - the file lay situated in a parent directory of the code instead of a local directory. This was easily divulged when I ran strace and could readily see all file open operations during the start up phase. What I was able to readily infer from its output was much more meaningful to me than Java code complaining that it could not open a file and then immediately terminating (but not divulging the path of the file it was trying to open).

Strace is powerful but its output can be quite noisy if the application being monitored has a high level of activity. If per chance monitoring a forking server such as Apache or Postfix is desired then strace really gets noisy. Making sense of the "spaghetti" that is returned back from system calls associated with a parent/child process tree is not fun. While it is possible to have strace latch onto a specific child process, errant behavior that is occurring may not be happening in the process being monitored but instead a non-monitored sibling process. Therefore using strace to have complete visibility into a forking server where the output is readily digestable is problematic.

I found myself precisely in this boat as I wanted to monitor the file activity of a Postfix instance running under Solaris. In Solaris, strace's equivalent is truss. After not too long here is the solution I devised:

lsof |
grep TCP |
grep smtpd |
awk '{print $2}' |
sort |
uniq |
perl -lane ' system("/usr/X/bin/xterm \"-sb\" \"-sl\" \"1000\" \"-e\" \"/usr/bin/truss\" \"-p\" \"$_\" &") '

lsof is a tool that will list all open file handles on the operating system, including network file handles. I specifically was looking for processes that had TCP/IP network connections (first filter). Then I looked for smtpd which is the name by which Postfix is listed in the process table (second filter). After which I used awk to pluck all the Postfix process ids which are in the second column of lsof's output (third filter). I sorted these (fourth filter) and removed any duplicates (fifth filter). Finally, this list of process ids was fed into a PERL one liner (sixth filter) that spawns xterms with the -e option. This option launches an application inside of an xterm, in this case truss.

Employing Cygwin's X Server port on my Windows XP desktop, an xterm appeared for each Postfix process running under Solaris (see image above), with truss showing me all the system calls for each of them in real time. Outside of the fact that this approach yields information that is readily consumable for spur of the moment diagnostics, another advantage is that when a process terminates, the corresponding xterm housing truss monitoring that process disappears off the desktop. Which is good visual feedback for taking action depending on the context of the situation.

For PERL to correctly interpret my intentions, I had to escape the double quotes. Furthermore when xterm is launched in this fashion, double quotes are necessary to delimit the command line arguments. So the sample code is harder to read than if one were to interactively fire off an xterm interactively against each and every process like so:

xterm -sb -sl 1000 -e truss -p process_id_in_question

This statement says to fire up truss inside of an xterm against the process id specified after the -p argument with a scroll bar (-sb) and with a 1000 line scroll back buffer (-sl 1000).

Canonically, child processes of forking servers are ephemeral. That is, a child process on a forking server will handle a given number of requests and then terminate. This design philosophy insures that one child cannot live long enough to consume all available system resources, either by accident or malice (serving as a proxy for Denial of Service attacks).

Therefore coming up with a script that hard codes process ids is pointless as their number and ids will always vary over time. Whereas this technique fetches process ids on the fly and couples this with the ability of an xterm to house an application (truss in this case), the end result being an ad hoc GUI for stracing/trussing a forking server.