Monday, March 24, 2008

Just A Bad Memory

Java application servers have become pervasive in the last decade as web based applications have replaced old school Visual Basic (VB) applications. Nowadays many corporate web applications are written using Java where the code ultimately runs on a J2EE application server. Java's pedigree is C++ and much to Joel Spolsky's chagrin, Java makes things much easier on developers in one area, memory allocation. In C/C++ a programmer is completely responsible for allocating and deallocating memory. Because of this, bugs at every level, from drivers to operating systems to applications, have found life in C and C++ codebases since the inception of those programming languages in the 70's and 80's, respectively.

Java does away with this onus by completely managing memory. Developers are free to use memory liberally with nary a care. Memory is reclaimed by the Java runtime through garbage collection in the background.

But garbage collection has issues of its own. The biggest is that applications can appear to hang as the Java runtime expends more time trying to reclaim memory than executing code. If you have a time sensitive situation, it can result in outright application failure.

Such was the case when some applications at my employer running under Oracle's application server would die after running an extended period of time. Application failure in the logs was indicated by successive log entries in a short period of time indicating full garbage collection, e.g.:


2624761.994: [Full GC ... 13.5346127 secs]
2624787.633: [Full GC ... 13.4446663 secs]
2624824.282: [Full GC ... 13.4713927 secs]


The indicator here is Full GC. This example indicates full garbage collection had taken place three times in less than a minute. I was asked if successive garbage collection could be monitored, sending an alert if in fact that was the case.

Monitoring a single file is easy, a quick PERL hack would be:
open FH, "tail -f log_file |"
while( <FH> ) {
if ( m/some_string/ ) {
# Do something when I've seen "some_string"
# Like send an email
}
}

You could nohup such a script and leave it running indefinitely in the background.

The problem with such an approach is an application server tends to house lots of J2EE applications. You would need a script for each application.

This simple solution does not get around the fact that log files are typically rotated away in the middle of the night. In other words, after a day the log file that is being monitored is no longer reflecting transactions on account of having been renamed with a date extension and moved elsewhere. A new log file is created and the monitor would not be attached to this new file.

Such a solution would also be maintenance prone since a monitoring script would be needed for each application. Therefore with new applications, a script would have to follow in tow and over time the monitoring system may not reflect what is actually running inside a J2EE application server. This shouldn't be surprising since it's quite common for people who write monitors to be separate from application developers.

Furthermore different servers have different applications so now you have a set of monitoring scripts that varies depending on what server you're situated at.

Taking all this into consideration I devised a forking PERL script that gets around all these problems. Forking is not something that surfaces often when writing PERL scripts but its use can greatly simplify some problems. The solution I devised was to have a parent process spawn one child for every log file (associated with an application). The log files are enumerated through a regular expression and the file list is then fed into the PERL script. This way I do not need to keep a list of what applications are running where. In the case of Oracle's application server it prefixes application logs with OC4J~OC4J_.

ls | egrep "OC4J~OC4J_[A-Z]+" | egrep -v ":[0-9]+$" |
nohup perl monitorGC.pl &

And finally the PERL script itself:


#!/usr/bin/perl

$SIG{CHLD} = 'IGNORE';

# Variable a bit of a misnomer at this point, length of time
# for parent to sleep.
$logRotationInterval = 3600; #seconds

# If we see full garbage collection happening more than once in this
# interval, send an alert.
$interval = 60; # seconds

@files = <>;
chomp(@files);

# Process IDs of children;
@pids;

while ( true ) {
# Loop forever. After one day spawned children will quit since log
# files are rotated once a day. This main loop will then spawn new
# children (see comments below).

for ( $index=0; $index<=$#files; ++$index ) {
my $pid = fork();
$pids[$index] = $pid;
if ( $pid == 0 ) {
$last = 0;

open FH, "tail -f $files[$index]|";
while( ) {
if ( m/\[Full GC/ ) {
if ( m/([0-9]+\.[0-9]+) secs]/ ){
if ( $1 > 2 ) { # If Full GC is Over 2 seconds
m/^([0-9]+\.[0-9]+):/;
open LOGENTRY, ">/tmp/monitorGCevent.txt";
print LOGENTRY $files[$index]."\n\n";
print LOGENTRY $_;
close LOGENTRY;
if ( $1 - $last < $interval ) {
`./sendGCMail.sh $files[$index]`;
}
$last = $1;
}
}
}
}
}
}

# Sleep while children do their work
sleep $logRotationInterval;

# Kill children there's no reliable way to have them exit
for ( $index=0; $index<=$#files; ++$index ) {
`kill -9 $pids[$index]`;
}

# Kill the tail processes that were spawned
`killTails.sh`;
}


So there you have it. A PERL script that is application agnostic that monitors successive garbage collection events and sends out an email when such an event is detected.

Thursday, March 13, 2008

GotoMySshPC


One question that surfaces sporadically when managing Internet facing IT systems is What does the rest of the world see? Occasionally after a systems deployment Internet traffic cannot reach some or all of the systems deployed.

This past week for the Nth time in my career this question surfaced. Everyone uses email but few understand the mechanism by which email delivery actually happens. MX records are the key; more on the topic in a moment.

Domain Name Servers (DNS) translate easily read and memorized host names such as www.google.com into TCP/IP addresses, that are not so easy to remember. For example, my Windows XP desktop currently would try to hit one of the following TCP/IP addresses if I were to browse www.google.com:

64.233.167.99
72.14.207.99
64.233.187.99


Quick! Look away! Can you recall any of those IP addresses? If you're like most people, probably not. This particular information was retrieved from my ISP's DNS server. A DNS server therefore is not much different than a phone book.

DNS servers also play the crucial role of facilitating mail delivery. They store different types of information and the type used for facilitating the delivery of mail is the MX record. MX is short Mail eXchanger. When using an email client such as Outlook to send an email, your ISP's mail server which Outlook happens to be chatting with, ultimately has to talk to another computer to deliver the email you just wrote. What computer your ISP's mail server talks to, to deliver that email, is answered by MX records.

One morning mail delivery for a particular domain of my employer was failing and I was called to investigate. On a hunch I did an MX record query for the domain in question and noticed the MX records for the domain had disappeared. This meant that mail delivery to that domain would fail. After further investigation it seems a typo in a DNS configuration file associated with a publish the day prior caused the problem. The resolution was to correct the typo, republish our DNS information and have it propagate across the Internet.

DNS is a hierarchical system. If your local DNS server does not have information that is requested, e.g., what does www.crazydeals.com resolve to?, the request propagates upwards ultimately reaching the top of the hierarchy reaching the root DNS servers of the Internet. The top of the hierarchy is then responsible for disseminating changes is has become privy to, changes that have often occurred at the bottom of the hierarchy. As news of a local change reaches the top, queries from DNS servers at the bottom of the hierarchy percolate to the top reaching the authoritative servers who then pass that information back down about the change that was made in another part of the tree. This information exchange among DNS servers is constantly happening. It usually takes several hours for DNS changes one makes locally to spread across the Internet, sometimes longer.

As the day wore on, I was in the office and wanted to check what my home system saw, i.e., was the DNS change that was made to correct the typo making its way across the Internet.

Thus the age old problem of wanting to conveniently reach my home computer while at the office to check what it was seeing on the public Internet had reared its head again. Once while doing contract work I overheard an individual in a cubicle next to me call his wife to ask her to browse pages to see if they were publicly available. Suffice to say, this problem surfaces on a semi-regular basis over the course of one's career.

Several years ago I had setup a LINUX box that I would reach via SSH. SSH (Secure Shell) is a cryptographic protocol that provides a mechanism to allow issuing commands to a remote computer system. While most people are used to GUI desktops, most back end computer systems are controlled through command line interfaces. If you think the monolithic bank that houses your money along with that of hundreds of thousands other people manages it all by clicking on icons with a mouse, think again. SSH is commonly the substrate by which remote control/management of many large IT systems happens.

One particular command line program that can be used to query DNS servers is nslookup and it comes with any contemporary operating system that connects to the Internet. Every single eye-candy laden Windows and Macintosh system ships with this command line tool. Thus if I could reach my Windows XP desktop, I could use this tool to gauge how far along our change had propagated. At least one vantage point anyway, that of my ISP's DNS servers. Not all DNS servers are public facing, which is why I wanted to see what my home system saw. I wanted to see if the change had trickled down to the private portions of the DNS hierarchy.

There are commercial services that allow one to readily reach one's home computer. One of the better known services is GotoMyPC. My primary issue with GotoMyPC is cost. It runs about $180 for one PC and it is a service, which means count on paying $180 every year.

Another option would be to have the desktop you find yourself at become part of your home network. This is a more general scenario than what GotoMyPC does and is the reverse of what most people do when working remotely. So instead of you connecting to your office via a VPN client, where by though the powers of indirection, you suddenly join your office's network allowing you to work remotely, you would do the same except in the reverse. Your office desktop becomes an extension of your home network. To this end there is an excellent piece of open source software called OpenVPN.

However a VPN solution to the home has two problems. One is that almost all home TCP/IP addresses are part of a DHCP pool so your TCP/IP address changes when the DSL/cable modem sitting in your home has to renegotiate its connection if for any reason connectivity is interrupted, however brief the interruption may be. So if you were to take note of your TCP/IP address, you may find that the next time you were inclined to connect to your home PC, your home network's TCP/IP address is no longer the same. You would be completely out of luck.

A solution to this particular problem exists in the form of an organization known as DynDNS. They allow you to associate your home TCP/IP address with a world wide hostname/domain. This means after registering a domain and using DynDNS to provide DNS services for said domain, you could use something like mypc.mydomain.com when telling OpenVPN to connect to your home network while you are sitting at the office.

Since DynDNS and OpenVPN are free, they are a much more attractive scenario than the pricey yearly cost of GotoMyPC. However you still need to register a domain with a registrar before mypc.mydomain.com (whatever it really happens to be) takes on life. The cost of registering a domain is nominal and about an order of magnitude less per year (vs. GotoMyPC's yearly cost) if you shop around.

There is still one issue that GotoMyPC readily gets around that the OpenVPN/DynDNS solution does not. If your home connectivity is lost and a new IP address is issued, when this information reaches DynDNS' servers it can take several hours before this information (the new IP address) propagates across the Internet. But Murphy's Law may be in full force on a day you really are inclined to reach your home network. If per chance your DSL/cable modem did receive a new TCP/IP address, DNS propagation means it will be several hours before you can connect.

GotoMyPC gets around this by having agent software that chats with their own centralized servers. The agent communicates your home network's current TCP/IP address to centralized servers so DNS is not even involved when trying to reach your home desktop through their service.

If you do go down the DynDNS/OpenVPN route, then like any real business you better remember to renew the domain otherwise you will find your convenient hostname mypc.mydomain.com one day no longer works. It is actually a common problem and various companies including the likes of Microsoft have readily forgetten to do this.

Stepping back for a moment, this all boils down to having information readily in hand, i.e. what's the current TCP/IP address affiliated with my DSL/cable modem's connection to the Internet.

So after this past week I decided to solve this problem once and for all so I could always readily reach my home network, without cost and without hassles. No GotoMyPC, no OpenVPN/DynDNS, no registrars, no accounts with anyone.

Earlier I mentioned that I used to have a LINUX box with SSH running. The primary detractor in continued use was the fact that the TCP/IP address into my home network would change. Eventually I simply abandoned its use. Neither OpenVPN or DynDNS existed at the time.

Once again I would leverage SSH, specifically OpenSSH. While most IT administrators are used to using SSH with *NIX systems or network devices, it turns out OpenSSH can readily be configured to run as a service on a Windows XP/Vista/200x systems, thereby affording remote control/issuance of commands. OpenSSH can be downloaded as part of Cygwin:

www.cygwin.com

Cygwin is a layer of software that makes a Windows system appear *NIX-like. This means, among other things, various open source applications such as Apache's web server (1.3.x) can be compiled and run on a Cygwin equipped system with no changes to the *NIX source code whatsoever. But outside of providing programmatic similarities it also provides ports of back end services such as SSH (OpenSSH). Rather than dive into the gory details of setting up SSH on Windows, go here for an excellent how-to.

Cygwin provides all the command line utilities commonly used in *NIX. People often collectively call a software system LINUX, UNIX, etc. but many of the tools employed in those environments have no intrinsic functionality that ties them to any single operating system. For example, many of the GNU command line utilities have been ported to Windows. The executables are stand alone and run on Windows without any dependencies. When installing Cygwin, all of these tools are available, including command interpreters such as the bash shell.

All this means I should be able to write a script that allows me to retrieve the world wide TCP/IP address associated with my home network. This information however is not stored on any single machine inside of my home network. Like most people I have a home router fronting my network connections. Using my home router's web interface I can readily see what my outside TCP/IP address happens to be so the task is getting this information through a script.

Most people use their web browsers to fetch web pages but the HTTP protocol is very simple and command line utilities exist to do the same. They often form the basis of "heart beating" web applications, i.e., if I can fetch a web page, the web application is still up, if not, send out an email alert. But they have other uses such as this one. I simply want to log into the router, hit the web page that contains my outside TCP/IP address information then finally do something with that information. Easy enough:


#!/usr/bin/bash

while [ true ]; do
sleep 600
wget --http-user=user --http-password=passwd
http://192.168.1.1/Status.htm
cat Status.htm | cut -c3967-3981 |
perl -lane 'm/([0-9.]+)/; print $1' > currentIP
diff currentIP lastIP
if [ $? -ne "0" ]; then
# Send mail about IP change
cscript sendIPAddressViaMail.vbs
cp currentIP lastIP
fi
rm Status.htm
done



My home router performs basic access authentication and I can pass the requisite username/password through command line arguments to the wget utility program when I retrieve the web page that contains the TCP/IP address my DSL modem negotiated. In the case of my LinkSys router that happens to be:

http://192.168.1.1/Status.htm

Initially I used wget in an ad hoc manner and it by default it stores the page fetched in a file. I noticed the page that had been returned from the router had no line breaks so I fired up EMACS and navigated to the column where the TCP/IP address affiliated with my outside connection starts. Noting the column offset I scripted cut to pluck 15 characters (max possible TCP/IP address string length). Finally I filter the string through PERL so that only characters making up a TCP/IP address would be in the final string that I output to the console.

I store the extracted TCP/IP address in a file named currentIP and compare it with the diff command to the last IP address (lastIP is initially setup by hand). If there are no differences in the files (determined by checking diff's exit code), this indicates my TCP/IP address has not changed so the script goes to sleep for another ten minutes (600 seconds).

If the files are different then things get a bit more interesting. How do I communicate this information? Most *NIX administrators use the mailx utility to send email via their scripts. The problem with mailx is that it assumes there is a local mail server running and running your own mail server is a can of worms onto itself. And honestly I'm not inclined to run a mail server on my Windows desktop. Rather than dive into such issues, I leverage Microsoft Outlook. Knowing that the Microsoft Office applications can be automated with VBScript, I concocted the following script that is executed through the Windows command line tool cscript:


ESubject = "IP Address change"
SendTo = "mymail@someWebMailAccount.com"
Ebody = "IP Address change"
NewFileName = "D:\cygwin\home\mariop\currentIP"

Set App = CreateObject("Outlook.Application")
Set Itm = App.CreateItem(0)
With Itm
.Subject = ESubject
.To = SendTo
.Body = Ebody
.Attachments.Add (NewFileName)
.send
End With
Set App = Nothing


The script sends a text file that contains the TCP/IP address my DSL modem last negotiated as an attachment to my web based email account. This way I can always just browse my web mail to find out my home network's current TCP/IP address. If my TCP/IP address changes, an email will be sent out within ten minutes.

There was one last hurdle to this solution. In 2002 given the prevalence of VBScript based worms in years prior, Microsoft changed Outlook such that automating the sending of emails was not possible without confirming email sends. Outlook will now issue a pop up asking whether or not to allow email being sent through a script. This a major fly in the ointment since I need this to be unattended since after all I'm not at my home PC. After some googling I came up some freeware:

http://www.contextmagic.com/express-clickyes/

The latter page shows the pop up that surfaces if email is sent via VBScript.

There you have it, with all this in place I'm assured of always being able to reach my home desktop. No GotoMyPC, no OpenVPN/DynDNS, no registrars, no accounts with anyone.

It turns out the solution I employ can be used in conjunction with OpenVPN eliminating the need for DynDNS and having to register your own domain. However I prefer SSHing since this allows the computer I'm working on to maintain its local context. I can readily switch between what I'm doing at work and what I might want to do at home. Namely, the machine I'm working at doesn't become part of my home network and suddenly local/work resources are unavailable to me.

If you're at all familiar with SSH's abilities you can tunnel various application protocols such as Microsoft's Remote Desktop. This means I can reach my graphical Windows XP desktop as the need arises. Which is exactly what GotoMyPC provides. Except in my case, without the $180 yearly cost.

When I establish an SSH connection to my home system, I do something like this using the SSH binary that is part of Cygwin on my work machine:

ssh -L 3390:localhost:3389 my_home_ip_address

After I've logged onto my home system instead of giving a machine name to Microsoft's Remote Desktop client while I'm sitting at the office, I simply specify:

localhost:3390

And before too long I get a login to my home Windows XP desktop while sitting at the office. The entire conversation between work and home computers is encrypted.

GotoMySshPC.