Monday, March 24, 2008

Just A Bad Memory

Java application servers have become pervasive in the last decade as web based applications have replaced old school Visual Basic (VB) applications. Nowadays many corporate web applications are written using Java where the code ultimately runs on a J2EE application server. Java's pedigree is C++ and much to Joel Spolsky's chagrin, Java makes things much easier on developers in one area, memory allocation. In C/C++ a programmer is completely responsible for allocating and deallocating memory. Because of this, bugs at every level, from drivers to operating systems to applications, have found life in C and C++ codebases since the inception of those programming languages in the 70's and 80's, respectively.

Java does away with this onus by completely managing memory. Developers are free to use memory liberally with nary a care. Memory is reclaimed by the Java runtime through garbage collection in the background.

But garbage collection has issues of its own. The biggest is that applications can appear to hang as the Java runtime expends more time trying to reclaim memory than executing code. If you have a time sensitive situation, it can result in outright application failure.

Such was the case when some applications at my employer running under Oracle's application server would die after running an extended period of time. Application failure in the logs was indicated by successive log entries in a short period of time indicating full garbage collection, e.g.:


2624761.994: [Full GC ... 13.5346127 secs]
2624787.633: [Full GC ... 13.4446663 secs]
2624824.282: [Full GC ... 13.4713927 secs]


The indicator here is Full GC. This example indicates full garbage collection had taken place three times in less than a minute. I was asked if successive garbage collection could be monitored, sending an alert if in fact that was the case.

Monitoring a single file is easy, a quick PERL hack would be:
open FH, "tail -f log_file |"
while( <FH> ) {
if ( m/some_string/ ) {
# Do something when I've seen "some_string"
# Like send an email
}
}

You could nohup such a script and leave it running indefinitely in the background.

The problem with such an approach is an application server tends to house lots of J2EE applications. You would need a script for each application.

This simple solution does not get around the fact that log files are typically rotated away in the middle of the night. In other words, after a day the log file that is being monitored is no longer reflecting transactions on account of having been renamed with a date extension and moved elsewhere. A new log file is created and the monitor would not be attached to this new file.

Such a solution would also be maintenance prone since a monitoring script would be needed for each application. Therefore with new applications, a script would have to follow in tow and over time the monitoring system may not reflect what is actually running inside a J2EE application server. This shouldn't be surprising since it's quite common for people who write monitors to be separate from application developers.

Furthermore different servers have different applications so now you have a set of monitoring scripts that varies depending on what server you're situated at.

Taking all this into consideration I devised a forking PERL script that gets around all these problems. Forking is not something that surfaces often when writing PERL scripts but its use can greatly simplify some problems. The solution I devised was to have a parent process spawn one child for every log file (associated with an application). The log files are enumerated through a regular expression and the file list is then fed into the PERL script. This way I do not need to keep a list of what applications are running where. In the case of Oracle's application server it prefixes application logs with OC4J~OC4J_.

ls | egrep "OC4J~OC4J_[A-Z]+" | egrep -v ":[0-9]+$" |
nohup perl monitorGC.pl &

And finally the PERL script itself:


#!/usr/bin/perl

$SIG{CHLD} = 'IGNORE';

# Variable a bit of a misnomer at this point, length of time
# for parent to sleep.
$logRotationInterval = 3600; #seconds

# If we see full garbage collection happening more than once in this
# interval, send an alert.
$interval = 60; # seconds

@files = <>;
chomp(@files);

# Process IDs of children;
@pids;

while ( true ) {
# Loop forever. After one day spawned children will quit since log
# files are rotated once a day. This main loop will then spawn new
# children (see comments below).

for ( $index=0; $index<=$#files; ++$index ) {
my $pid = fork();
$pids[$index] = $pid;
if ( $pid == 0 ) {
$last = 0;

open FH, "tail -f $files[$index]|";
while( ) {
if ( m/\[Full GC/ ) {
if ( m/([0-9]+\.[0-9]+) secs]/ ){
if ( $1 > 2 ) { # If Full GC is Over 2 seconds
m/^([0-9]+\.[0-9]+):/;
open LOGENTRY, ">/tmp/monitorGCevent.txt";
print LOGENTRY $files[$index]."\n\n";
print LOGENTRY $_;
close LOGENTRY;
if ( $1 - $last < $interval ) {
`./sendGCMail.sh $files[$index]`;
}
$last = $1;
}
}
}
}
}
}

# Sleep while children do their work
sleep $logRotationInterval;

# Kill children there's no reliable way to have them exit
for ( $index=0; $index<=$#files; ++$index ) {
`kill -9 $pids[$index]`;
}

# Kill the tail processes that were spawned
`killTails.sh`;
}


So there you have it. A PERL script that is application agnostic that monitors successive garbage collection events and sends out an email when such an event is detected.

No comments: