Monday, May 12, 2008

Practical *Extraction* and Report Language

Since its debut in the 80's PERL entrenched itself in many computer systems as its power found many uses. The advent of the web applications in the 90's saw PERL spread beyond its roots as the nascent web development community used it to write CGI scripts.

Nevertheless PERL's forte is clearly communicated in its acronym. Perhaps you're after the extraction of information from a large data set based on some arbitrary criteria. Perhaps the only thing you need is to generate a CSV file that will be fed to Microsoft Excel to generate a chart, a form of a report.

The former was the case when a coworker brought me a log file with thousands of lines. He specifically wanted to filter lines in a log file where the criteria was an email address, here's a sample:


./app.comp.log:2008-05-01 08:35:53,288 [WorkExecutorWorkerThread-3] ERROR com.mycompany.events.bo.ejb.NewsSubMDBean - onMessage: Failed: <root><com.ftd.events.core.NewsEventVO Status="UNSUBSCRIBE" OriginAppName="BOUNCE" EmailAddress="anemail@nowhere.com" CompanyId="123" OperatorId="HARD BOUNCE" OriginTimestamp="1209600000000"/></root>

./app.comp.log:2008-05-01 08:35:53,221 [WorkExecutorWorkerThread-3] ERROR com.mycompany.events.bo.ejb.NewsSubMDBean - onMessage: Failed: <root><com.ftd.events.core.NewsEventVO Status="UNSUBSCRIBE" OriginAppName="BOUNCE" EmailAddress="anemail@nowhere.com" CompanyId="XYZ" OperatorId="HARD BOUNCE" OriginTimestamp="1209600000000"/></root>


In these two lines the email address is the same, so the second line and any other lines that followed would be disregarded.

If the goal was to simply extract unique email addresses then the ubiquitous cut, sort and uniq text utilities on *NIX platforms could trivially solve the problem. Except the goal was not to extract email addresses, the information around the email address needed preservation. This meant I couldn't apply a cut/sort/uniq solution.

The solution was to employ PERL to save state. I was able to do this in a one-liner:


cat unfiltered.txt |
perl -lane ' m/EmailAddress=\"(.*)\"\sCompanyId/; if ( $have{$1} ) { ++$noise; } else { $have{$1} = 'yes'; print $_} ' > filtered.txt


-lane is one of my favorite command line switch combinations in PERL. a puts every line of input into the anonymous variable, $_. n gives you an implicit while loop to traverse input coming from stdin, l allows you to print information with an implicit newline to help shorten the length of your one liner when sending information back out. e simply is evaluate the expression (code) that follows.

Using a regular expression and a hash allows me to keep track of which email addresses I've already seen. By not printing a line when an email address is already in the hash, I wind up with unique email addresses with surrounding context.

Easy.