Andy is an assistant professor at the School of Information Technology and Engineering at the University of Ottawa in Canada. He can be reached at [email protected]
I use Perl to manage files, and Octave (a high-level, numerical-computation language) to crunch numbers. Recently, I worked on a project that generated enormous data files, which needed to be processed and then analyzeda perfect task for my two favorite languages. Since I'd just heard a mighty cool talk on the Inline module, it seemed clear to me that I needed to write Inline::Octave. Unlike some other Inline languages such as C or Java, Octave runs as an interpreted environment and does not natively support sockets or other interprocess communication facilities. The choices were, therefore:
- Modify Octave.
- Control it from Perl by typing into the interpreter and reading the output.
I chose the latter approach to allow the technique to work with unmodified Octave.
Perl allows controlling the STDIN, STDOUT, and STDERR of a process using the IPC::Open3 module. However, it contains a number of warnings; for example, "This is very dangerous, as you may block forever." Suitably forewarned, I set out on this hazardous enterprise, learning a number of tricks, which I will describe in this article.
Example
Is the global temperature rising? Lets suppose we've decided that we don't trust those pundits, and we'd like to calculate for ourselves whether the earth is getting warmer. However, for some crazy reason, we do trust random stuff published on the Internet, so we type "Daily Temperature Data" into Google. The third link, "Temperature Data Archive," looks promising and takes us to http://www.engr.udayton.edu/weather/. Looking further, we find a link with the title, "All sites in single file (about 4.5 Megabytes)," and download it.
$ wget ftp://ftp.engr.udayton.edu/jkissock/gsod/allsites.zip
The zip file has a simple structure; data from each temperature measurement site is in a separate archive file, and each file is structured into lines with whitespace-separated numbers for the day, month, year, and temperature. Clearly, this is a job for Perl. The script I'll describe here, temp-analyse.pl, takes the name of the file as a parameter and prints the calculated data.
Perl Code
Here's the beginning of temp-analyse.pl:
01 use Time::Local; 02 my ($city, @time, @temp, %rates); 03 04 open F, "unzip -c $ARGV[0] |" or die "Can't open $ARGV[0]: $!"; 05 while (<F>) { 06 if (/^\s+ (\d+) \s+ (\d+) \s+ (\d+) \s+ ([\-\d\.]+)/x) { 07 next if $4 == -99; # code for no data available 08 push @time, timegm( 0, 0, 12, # sec, min, hour (assume noon) 09 $2, $1-1, $3-1900); # mday, month (0-11), year 10 push @temp, ($4-32)/1.8; # convert to degree C 11 } 12 if (/^\s+ inflating: \s+ (\w+) \.txt/x) { 13 process(\@time, \@temp) if $city; 14 $city= $1; @time= (); @temp= (); 15 } 16 } 17 process(\@time, \@temp); #last one 18 close F; 19 20 printf "Average change is %1.4f %1.4f (C/year) for %d cities\n", 21 calc_stats()->as_list;
We need to convert from time in days, months, and years to a linear unit of time, such as seconds since the epoch. Perl has lots of time modules for advanced processing, but Time::Local is included with Perl, does the job, and is easy to use. We include this module in line 1 and use it in line 8. Perl also provides a great file-open function to allow reading from a pipe (line 4), and we use the command-line unzip -c filename to feed each file in turn to Perl.
The command unzip -c will loop through the contents of the archive, and for each file, output "inflating: filename" followed by the file contents. We parse the file contents (whitespace-separated numbers) on line 6 using a regular expression. The last number is the temperature and can include a minus sign and a decimal point, so we use the pattern ([\-\d\.]+) to match any number of these characters. The web site indicates that a temperature of -99 indicates missing data. We first test for that condition (line 7) and skip the processing if it is found. Otherwise, if the data are good, they are stored into the lists @time and @temp.
We detect that a new filename in the archive is being decompressed by matching the text "inflating:" (line 12). This means that the data for the previous file is complete, and can be processed by the subroutine process (line 13). The lists are then reset (line 14), and we return to processing the contents of the new archive file. When all data are received, we process the data in the last file (line 17) and close the file (not strictly necessary). Then, in lines 20-21, we call a subroutine to calculate the statistics and print them for the user.
Up to this point, Perl is unquestionably the right choice of language. Its ability to integrate with other tools via a pipe and to parse text with a minimum of painful syntax has allowed us to accomplish this part very quickly. Unfortunately, the mathematical analysis of this data is less easy with Perl. Of course, a suite of mathematical modules does exist for Perl (PDL), but the most common language for this sort of analysis is Matlab. Luckily, the open source GNU Octave software (http://www.octave.org/) is mostly compatible with Matlab and can be readily integrated with Perl.
Octave Code
Since this is an article about Perl, the calculations will be described briefly. Figure 1 illustrates some of the difficulties in estimating the trend line from the data. The graph shows temperature (degrees C) versus time for the years 1995-2003 for my hometown (Ottawa, which happens to have one of the largest annual temperature ranges). The raw data (red), shows a very large annual temperature cycle, much larger than the effect we're trying to measure (blue). In order to estimate the trend, we need to remove the components of the data that are in phase with the year (green). The best fit line is then calculated for the data (blue), and the slope of the line in degrees/year is the temperature trend.
The code shown below implements the functions process and calc_stats. time and temp are vectors corresponding to the Perl lists @time and @temp. Lines 25-31 calculate the component of the signal in phase with the first three harmonics of the year frequency. The value 365.2422 is the number of days in a year (see http://www.straightdope.com/mailbag/mleapyr.html). In lines 32-34, the year harmonics are removed from the signal (to calculate temp_clean), the best fit line is calculated, and the subroutine return value TperYr is calculated. Finally, in lines 36-37, the slope is stored in the vector sites that, because of the keyword global, may be accessed by the subroutine calc_stats. This last subroutine calculates the data mean and standard error, and the number of sites:
24 function TperYr= process( time, temp ); 25 time_step= [0;diff(time)/2] + [diff(time)/2;0]; 26 year_freq= 1/( 365.2422*24*60*60 ); 27 harmonics= 2*pi*(1:3)*year_freq; 28 year_osc= [ sin(time * harmonics), cos(time * harmonics) ] ... 29 .* (time_step * ones(1,2*length(harmonics))); 30 component= (temp' * year_osc) ./ sumsq( year_osc ); 31 32 temp_clean= temp - year_osc * component'; # remove year harmonics 33 fit = polyfit( time, temp_clean, 1); # fit to line 34 TperYr = fit(1) / year_freq; # convert to deg/year 35 36 global sites=[]; static site_no=1; 37 sites( site_no++ ) = TperYr; # store city data 38 endfunction 39 40 function stat_data= calc_stats() # mean, std err, number 41 global sites; # load city data 42 n_sites= length(sites); 43 stat_data= [mean(sites), std(sites)/sqrt(n_sites), n_sites]; 44 endfunction
Inline
In Perl, we are able to easily glue code in various languages together thanks to the infrastructure provided by Brian Ingerson's Inline module. The Inline::Octave module, which I wrote, allows us to glue these functions together with two lines of Perl:
23 use Inline Octave => q{ 2443 # Octave code (above) 45 } ;
Finally, with the code assembled together, we get:
$ perl temp-analyse.pl allsites.zip Average change is 0.0585 0.0096 (C/year) for 324 cities
Our results agree with the pundits; there is a warming trend. Be careful, however; this analysis looks at a small slice of time, is not mathematically rigorous, and, most importantly, says nothing about the underlying causes.
A Closer View of the Glue
The Inline infrastructure allows the details of the glue to be hidden. An Inline language may be compiled and linked to Perl, such as C, or may be interpreted, such as Python, Java, or Octave. One wrinkle in this classification is that Java, being a hybrid interpreted/compiled language, can also be linked to Perl. Both Inline::Python and Inline::Java set up a socket connection for interprocess communication. Unfortunately, stock Octave does not have sockets, so the best way to control it seemed to be to type commands into it from Perl, and read back the output. (After Inline::Octave was developed, Paul Kienzle developed a similar Tcl/Tk glue to Octave, and decided to approach it by building a socket library for Octave. For details, see http://www.arxiv .org/pdf/physics/0211037)
In order to do this, Perl provides the IPC::Open3 module to control the STDIN, STDOUT, and STDERR of a process. However, the documentation says: "If you try to read from the child's STDOUT writer and their STDERR writer, you'll have problems with blocking." Well, I did try, and I did have problems; however, in solving these problems, I learned a number of tricks that I describe here. These techniques are appropriate for any interpreter and should work for shells and for command-line utilities. The core requirement is that the interpreter tool is understood well enough that it can't give unexpected output.
In order to illustrate the techniques used to do this kind of interprocess communication, I've built a small module, Example.pm, which illustrates the IPC functionality:
01 package Example; 02 use strict; 03 use Carp; 04 05 use IO::Handle; 06 my $Oerr= new IO::Handle; 07 08 use IPC::Open3; 09 open3( my $Oin, my $Oout, $Oerr, "octave -qH"); 10 eval{ setpriority 0,0, (getpriority 0,0)+4; } #lower priority slightly 11 12 use IO::Select; 13 my $select = IO::Select->new($Oerr, $Oout);
The Carp module (line 3) is used in order to allow errors and warnings to appear from the point of view of the calling subroutine. Prior to calling open3, it is important to ensure that we have a valid STDERR ($Oerr); otherwise, error output will be sent to STDOUT. This is accomplished in lines 4-5, using the IO::Handle module. The octave process (line 9) is started with appropriate command-line flags, and handles to STDIN, STDOUT, and STDERR are created. open3 will die if the process can't be started, which is just fine for this example. I've found that decreasing the priority of the parent process (Perl) slightly (line 10) can make this kind of communication more efficient. The idea is that the interpreter has more of a chance to fill its output buffer, and Perl makes fewer (but larger) input reads. The priority control functions are not available on some systems (e.g., Windows) and will cause a fatal error. The solution is to wrap line 10 in "eval {}".
In lines 12-13, an IO::Select object is created with the STDOUT and STDERR filehandles. IO::Select is a (very convenient) wrapper around select that allows multiple filehandles to be waited on simultaneously. Without this function, one could make a blocking read on one handle, but if the writing process wrote to the other, it could result in deadlock. I've quite often seen programmers who try to solve this type of problem with polled I/O; nonblocking reads are alternately made on each filehandle. This is an inefficient, unreliable, and generally bad solution. The select approach allows the OS kernel to do the job (perhaps using hardware interrupts or other facilities unavailable to user-space software). Surprisingly, IO::Select works fine on windows (tested on 2000+), even while the underlying C implementation only works for sockets and not filehandles in win32.
Once the process has been started with open3, we want to be able to call a subroutine interpret, which will call the interpreter with the supplied string and return its output:
14 sub interpret { 15 my $cmd= shift; 16 my $marker= "-9Ahv87uhBa8l_8Onq,zU9-"; # random string 17 my $marker_len= length($marker)+1; 18 19 print $Oin "\n\n$cmd\ndisp('$marker');fflush(stdout);\n"; 20 21 my $input; 22 while ( 1 ) { 23 for my $fh ( $select->can_read() ) { 24 if ($fh eq $Oerr) { 25 process_errors(); 26 } else { 27 sysread $fh, (my $line), 16384; 28 $input.= $line; 29 } 30 } 31 last if substr( $input, -$marker_len, -1) eq $marker; 32 } 33 34 # leave process blocked doing something, or it can't handle CTRL-C 35 print $Oin "\n\nfread(stdin,1);\n"; 36 return substr($input , 0 , -$marker_len ); 37 }
In order to avoid deadlock, we need to clearly separate the phases when we write to or read from the interpreter. To be able to detect the end of the output from the interpreter, we define a marker ($marker, lines 16-17) with a random string that is, with any luck, very unlikely to appear in the output by itself. Then (line 19), the command $cmd is sent to the interpreter STDIN ($Oin). To the command, we prepend newlines (explained later), and append a command (disp) to print the marker after the completion of the instruction.
In lines 21-32, we enter an infinite loop, waiting for output from the interpreter and looking for the marker string. On line 23, we use the select function to wait on input from the interpreter on either STDERR and STDIN. If STDERR input is received (lines 24-25), we do special processing in the subroutine process_errors; otherwise, STDIO is read and appended to $input (lines 27-28). The function sysread does nonbuffered I/O on a filehandle. It is required in this instanceusing $line= <$fh> results in a deadlock.
The input is tested for the marker string (line 31) and, if found, the loop is terminated and the interpreter output (without the marker) is returned at line 36. One final trick is to leave the interpreter blocked reading on STDIN. This is the reason for the prepended newlines in line 19; the fread instruction needs some "throw away" input to consume. This blocking read is required to allow a clean shutdown of Perl and Octave when the user enters Ctrl-C. There seems to be an issue with GNU readline (which is used by Octave and many other interpreters, including bash). If the child interpreter receives Ctrl-C and the interpreter is not blocked, readline seems to panic and prevent a clean shutdown. This is a hack, but it may be useful in other scenarios as well. Another possible approach would be to trap Ctrl-C (using a custom $SIG{INT} subroutine), then send appropriate commands to the interpreter.
The interpreter sends errors and warnings to its STDERR. If we detect this, the subroutine process_errors is called:
39 sub process_errors 40 { 41 my $select= IO::Select->new( $Oerr ); 42 43 my $input; 44 while ( $select->can_read(0.1) ) { 45 sysread $Oerr, (my $line), 1024; 46 last unless $line; 47 $input.= $line; 48 } 49 50 croak "$input (in octave code)" if $input =~ /error:/; 51 carp "$input (in octave code)" if $input; 52 } 53 54 1;
The concept behind process_errors is that warnings and errors will be short bursts of text written to STDERR. We are thus able to temporarily stop reading from STDIN and focus exclusively on STDERR until the error text is complete. The end of the stream of error text is defined to be 0.1s with nothing read on the filehandle. An IO::Select object is created with only the STDERR filehandle (line 41). This allows us to test for activity with a timeout with the can_read method (line 44). After the error text is read, we need to decide whether it constitutes a warning or an error; for this simple example, we simply test for the string "error:" (line 50). By using the carp and croak functions, warnings and errors are generated, which will appear as a "warn or die" within the code that calls the interpret subroutine. Normally, Perl will print warn text to STDERR and continue, but will stop execution for die. Of course (being Perl), this behavior can be altered. die can be intercepted by wrapping the commands in eval:
eval { # commands }; if ( $@ ) { # error handler }
warn can be intercepted by defining a handler for the __WARN__ pseudosignal. Within a block, we use local to override $SIG{__WARN__} to a handler subroutine of our choice. See perlvar(1) for the gory details:
{ local $SIG {__WARN__} = sub { my $warning = shift ; # warning handler }; # commands }
In order to illustrate the functioning of Example.pm, the following examples call Example::interpret with a string that is evaluated with:
1) no errors ("1/2"); 2) a warning ("1/0"), and 3) a syntax error ("1/*") 01 $ perl -MExample -e'print Example::interpret("1/2")' ans = 0.50000 02 $ perl -MExample -e'print Example::interpret("1/0")' warning: division by zero (in octave code) at -e line 1 ans = Inf 03 $ perl -MExample -e'print Example::interpret("1/*")' parse error: >>> 1/* ^ (in octave code) at -e line 1
In terms of performance, this technique seems to be mostly limited by the speeds of the interpreters, and the loops and memory allocation for the I/O. Measurements on a PII 266 MHz (running Linux Mandrake 9.1) show that data can be transferred at about 0.8 MB/sec, while Perl can write to /dev/null at about 50 MB/sec.
Thus, we've seen that Perl makes it quite easy to integrate an interpreter into a Perl script by controlling its input and output filehandles. It is, of course, quite true that there are several traps to avoid, and mistakes typically result in process deadlock. On the other hand, when done carefully, this capability can be very useful. We see that, once again, Perl merits the distinction as "the duct-tape of the Internet."
TPJ