Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at [email protected].
In my last article, I wrote about how important it was to keep on learning new technologies and broadening your horizons; this time, we're going to look at some things you might have missed inside Perl itself. I've just finished writing the new edition of Advanced Perl
Programming, which covers a lot of the more practical things about writing Perl applications - useful CPAN modules and techniques. But what about the more hidden and less obviously practical things? Here is a collection of ten facts to intrigue, delight and inspire...
You can write scripts in UTF-8
Someone mentioned to me the other day that Perl 6 was the only language they knew of which allowed Unicode identifiers. Funny, I thought - I'm sure you can do that in Perl 5 as well. Well, you can, so long as you remember to turn on the "utf8" pragma:
use utf8; my $ = Dog->new();
Similarly, subroutines and comments can be in your native language and script. If you want to go further and start translating the actual language constructs ("if", "for", etc.) into your native language, then look at things like the "JCode" module on CPAN, and be aware that you're creating a maintainance nightmare for yourself.
There's a new way of trapping signals
I did not even know about this one until I was wandering around the Perl source trying to find interesting tidbits for this article, (which may reflect more on my ability to read the manual than anything else...) but it looks like a very useful tip. Normally, I'd trap signals in Perl with something like this:
$SIG{ALRM} = \&handler;
However, there's a pragma in the Perl core called "sigtrap" which makes it easier to handle signals. With "sigtrap" you'd write the above statement as:
use sigtrap handler => \&handler, "ALRM";
Why is this any better? Well, with "sigtrap," you can set special built-in handlers, to either die or to give a stack trace, and apply them to built-in sets of signals. For instance, to give a stack trace for any of the more "serious" error signals (ABRT, BUS, EMT, FPE, ILL, QUIT, SEGV, SYS and TRAP) that you haven't specified another handler for, you would say
use sigtrap 'stack-trace', 'untrapped', 'error-signals';
This makes it much easier to work out what your signal handling code is actually trying to do.
Appending from a file to a scalar is special-cased. Try these two little programs:
perl -we '$_ = <A>' perl -we '$_.= <A>'
Both will complain about "A" being used only once, and both will complain about reading an unopened filehandle, but look at how they do so:
readline() on unopened filehandle A at -e line 1. append I/O operator() on unopened filehandle A at -e line 1.
"Append I/O operator()"? What's happening is that appending from a
filehandle to the end of a scalar $x .= <..>
is optimized by Perl
into a special operation in its own right.
Actually, it turns out that readline
, or the "<FH>
" operator is
actually three different operators, as you can see from running
"perl -MO=Terse,-exec
" on the following three lines:
print <A>; $_ = <A>; $_ .= <A>;
The first one creates two operations, as you would expect: one reads the
line, and the other prints it. The second line does something I didn't
expect until I tried it - I would have expected one op to fetch $_
, one
to fetch "A", one "readline
" and one to assign them, but there's no
"sassign" op at the end. This is because "= <..>
" is a special case of
the readline op which knows to store its return value not onto the stack
but into the scalar variable referenced by the next value on the stack,
eliminating an assign op. And, as we've already seen, ".= <..>
" is a
special "rcatline
" operation that reads and concatenates at the same
time. Useful for micro-optimizers and interrnals hackers!
Subclassing Exporter
The usual way to write modules which provide subroutines is to optionally export them into the caller's package. So, for instance:
package My::Application; use URI::Escape qw(uri_escape); print uri_escape("Hello there");
Here URI::Escape
has provided the uri_escape
subroutine, which our
package has imported and used. It does this by subclassing the standard
Exporter
module, like so:
package URI::Escape; use vars qw(@ISA @EXPORT @EXPORT_OK $VERSION); use vars qw(%escapes); require Exporter; @ISA = qw(Exporter); @EXPORT = qw(uri_escape uri_unescape); @EXPORT_OK = qw(%escapes uri_escape_utf8);
The heavy work of moving subroutines from the URI::Escape
package into
the My::Application
package is done by Exporter
's import
method,
and use Foo
actually calls require Foo; Foo->import
so the
uri_escape
subroutine comes our way rather behind the scenes.
Now, suppose you're writing your module which imports subroutines this
way, but you also want to do some additional set-up code when the module
is loaded up. The import
routine is a sensible place to do this
set-up, so you say something like this:
sub import { my ($self, @stuff) = @_; $self->setup(); $self->SUPER::import(@stuff); }
Unfortunately, this doesn't work. Exporter
's import
works by looking
at caller
to determine who's calling it, and therefore ends up trying
to move symbols from Exporter
to My::Application
, instead of from
My::Application
to the client code. The way around this is to use the
export_to_level
routine instead, which gives you a parameter to
control how far back up the stack to perform the exporting:
sub import { my ($self, @stuff) = @_; $self->setup(); $self->export_to_level(1, $self, @stuff); }
The first parameter, 1, says that we're going back one level from the
point of view of Exporter
; the next is the package name, and then this
is followed by the import tags.
You can lock up hashes
Anyone who's done any development of "perl" itself over the past five years or so will cringe when I mention the word "pseudohashes". They were a great idea, to be honest, but the implementation was a little unfortunate.
The idea of a pseudohash goes like this. When you're using a hash as an object, you'll generally only have a fixed set of keys, known in
advance, that you're interested in using:
package Person; sub new { my $class = shift; bless { name => ..., address => ...., job => ..., date_of_birth => .... }, $class }
Now if at some point I say
$person->{dateofbirth}
then this is going to cause a bug. Perl should "know" that I meant to say "date_of_birth" and complain. Additionally, if we have a fixed set of keys, we can essentially turn these keys into constant indexes, and use an array instead of a hash, making access faster. So a pseudohash was something that behaved like an array and looked like a hash, and was implemented like the evil hybrid monster that this implies.
As a compromise, the functions in Hash::Util
were implemented. These
take an ordinary hash, and locks it down in various ways - prevents you
from adding new keys, or from changing certain values, and so on.
use Hash::Util qw(lock_keys); sub new { my $class = shift; my $self = bless { name => ..., address => ...., job => ..., date_of_birth => .... }, $class; lock_keys(%$self); return $self; }
Now if at some point I say $self->{dateofbirth} = "1978-05-29"
, Perl
will die because I tried to add a new key to a locked hash.
It's important to note that it doesn't prevent you from accessing
non-existant keys, so $person->{dateofbirth}
will still slip by
unnoticed. This means Hash::Util
isn't a complete replacement for
pseudohashes; try looking at Dave Cross's Tie::Hash::FixedKeys
for a
more robust but slower implementation of this idea.
DBM Filters
DBM files are essentially a hash on disk - they're random-access files
which allow you to associate a key with a value. My irrational favourite
is the Berkeley DB, DB_File
:
tie %hash, "DB_File", "test.db"; $hash{"Larry Wall"} = "555-112-3581";
The only slight problem with DBMs is that they're generally implemented
by an external C library which knows nothing about Perl, so you can't
store complex Perl data structures as DBM values. The usual way around
this is the "MLDBM" module, which sits in front of the DBM, marshalling
the data that gets stored and retrieved. If it comes across an attempt
to store a reference, it will use either "Storable
" or "Data::Dumper
" to
serialize that reference into a string; similarly, if you're retrieving
a string like that, it'll perform the appropriate inverse process
("Storable
" again, or "eval
") to turn the string back into a reference.
use MLDBM qw(DB_File); tie %hash, "DB_File", "test.db"; $hash{"Larry Wall"} = Person->new(...);
"MLDBM
" works the slow, stupid way; it implements the "tie
" interface
itself, does the serializing and deserializing, and then passes on the
request to another, underlying tied hash:
sub FETCH { my ($s, $k) = @_; my $ret = $s->{DB}->FETCH($k); $s->{SR}->deserialize($ret); } sub STORE { my ($s, $k, $v) = @_; $v = $s->{SR}->serialize($v); $s->{DB}->STORE($k, $v); } sub DELETE { my $s = shift; $s->{DB}->DELETE(@_); } sub FIRSTKEY { my $s = shift; $s->{DB}->FIRSTKEY(@_); } sub NEXTKEY { my $s = shift; $s->{DB}->NEXTKEY(@_); } ...
There's actually a better way to do things nowadays, which hopefully
"MLDBM
" will move to behind the scenes. You can now add filters onto a
DBM, so that when something is stored or retrieved, a subroutine of your
choice gets called. So we can implement the same functionality as
"MDLBM
" with just a few lines of code:
my $db = tie %hash, "DB_File", "test.db"; use Storable qw(freeze thaw); $db->filter_store_value(sub { $_ = freeze($_) }); $db->filter_fetch_value(sub { $_ = thaw($_) }); $hash{"Larry Wall"} = Person->new(...);
When we store the "Person" object, it goes through the filter we
registered with filter_store_value
, and the value is transformed via
the freeze
subroutine we got from Storable - this turns it into a
scalar value suitable for storing in the DBM. Similarly, when we
retrieve it again, it goes through Storable::thaw
which turns it back
into a reference.
For more about what you can do with DBM filters, see "perldoc perldbmfilter".
File handles with a _< in them?
Here's something that I was asked about the other day: why does my program contain globs which start "_<" followed by a filename? To see this for yourself, run this code:
perl -le 'print for keys %main::'
You'll see, amongst the rest of the keys in the symbol table, "_<universal.c". To make things more interesting, run the same code in the debugger:
perl -d -le 'print for keys %main::'
This time you'll see a few more: I got
_</usr/share/perl/5.8/Term/ReadLine.pm _</usr/share/perl/5.8/Carp/Heavy.pm _</usr/share/perl/5.8/strict.pm _</usr/share/perl/5.8/AutoLoader.pm
amongst others. Where do these come from? Well, there are actually two kinds of globs named like this. The first, usually non-Perl files like the "universal.c" that we saw earlier, are used as the globs attached to the XS subroutines that they contain. The second kind are provided by the debugger whenever a program file is loaded. From the "perldebguts" documentation:
- Each array
@{"_<$filename"}
holds the lines of$filename
for a file compiled by Perl... Values in this array are magical in numeric context: they compare equal to zero only if the line is not breakable.
- Each hash
%{"_<$filename"}
contains breakpoints and actions keyed by line number.
- Each scalar
${"_<$filename"}
contains_<$filename
.
You can find the name of a subroutine with Devel::Peek
This is another one which came up on IRC the other day. You have a subroutine reference, and you want to know what it's called, either so that you can report about it for your debugging, or you can do some dirty tricks. How do you know where the subroutine reference came from?
Let's say we want to make some method "private", in the sense that it can only be called by the class which created it. Here's as much of the "make_private" routine as we can do:
package UNIVERSAL; sub make_private { my ($class, $method) = @_; # Find out where the method was actually defined my $orig = $class->can($method) || return; my $subname = some_magic($orig); *{$subname} = sub { my $class = shift; die "$method is private" unless $class eq caller; $class->$orig(@_); } }
In this routine, $orig is a code reference. Normally, the
fully-qualified subroutine name for this method would be
$class."::".$method
, but since inheritance is in play, the method
might actually come from somewhere else; that's why we use can()
to
find out where it came from.
Of course now we need to go back from the code reference to the subroutine name, so we can write our "guard" subroutine into the appropriate glob. The guard subroutine, the inner one in our code sample, checks that we're calling this from inside the appropriate class, and then dispatches to the original subroutine.
The key thing we're missing is getting the subroutine name from the code
reference, and the answer to this is Devel::Peek::CvGV
. Devel::Peek
is better known for dumping out the internal details of Perl variables,
but in this case it comes to the rescue by looking at the code
reference's glob pointer, which tells us where in the symbol table it
lives.
A "debugger" doesn't have to debug
You may well be familiar with the Perl debugger, invoked via "perl -d
".
As it happens, that's not "the" Perl debugger; it's just "a" Perl
debugger, albeit the standard one which comes with Perl. A debugger is
just something that sits in the "DB
" package and implements a few
subroutines in there. The DB::DB
subroutine, for instance, is called
by Perl for every statement in your program. DB::sub
gets called for
every subroutine run. "perldebguts" describes the variables available in
these and other DB::
subroutines. Modules like Devel::Trace
demonstrate how to write your own debugger:
# This is the important part. The rest is just fluff. sub DB::DB { return unless $TRACE; my ($p, $f, $l) = caller; my $code = \@{"::_<$f"}; print STDERR ">> $f:$l: $code->[$l]"; }
Notice that this uses the special glob for a Perl code file we
discovered earlier, in order to extract the line of code currently being
run. For a debugger to run neatly, it should be named "Devel::
...". This
is because Perl turns "perl -d:Foo
" into the equivalent of
"use Devel::Foo
."
Devel::DProf
, Devel::Coverage
, and other modules in the Devel::
namespace show what can be done with customized debuggers.
The Internals package
Our final tip is the somewhat underdocumented "Internals" package. Like
UNIVERSAL
, this is a built-in package provided by the Perl core, which
contains a few handy functions. The first is SvREADONLY
. This is what
is actually used to lock hashes, as mentioned above. It takes any Perl
SV container - a scalar, an array or hash, or an array or hash reference
- and gets and sets the read only flag on that container:
% perl -le '@a=(0..10); Internals::SvREADONLY($a[5], 1); $a[5]++' Modification of a read-only value attempted at -e line 1.
Another function gets and sets the reference count of an SV:
my $count = Internals::SvREFCNT($obj); Internals::SvREFCNT($obj, $count+1); # Make immortal
Still other functions can get and set the internal seed of the hashing algorithm for a hash, or fiddle with the placeholders set up when a locked hash is used.
Conclusion
There are still things I'm discovering even about the internals of Perl itself, and many techniques which are only now being exploited to give interesting results. Source filters sat in the Perl core for a few years before people realised what sort of things could be done with them. We've taken a look at ten of the lesser known corners of Perl... it's up to you to do interesting things with them!
TPJ