Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Web Development

A Better Data::Dumper


March, 2004: A Better Data::Dumper

Randal is a coauthor of Programming Perl, Learning Perl, Learning Perl for Win32 Systems, and Effective Perl Programming, as well as a founding board member of the Perl Mongers (perl.org). Randal can be reached at [email protected].


Afew years ago, I stared quite heavily at the source to the core module Data::Dumper—enough to make my eyes hurt. I was trying to "reverse engineer" the output, so that I could write an un-dumper that would evaluate the resulting string of Perl code and get the original values back, without unleashing the full Perl expression evaluator. I succeeded in that, although the results were far too slow to be useful in a practical sense.

However, while I was staring at Data::Dumper's guts, I noticed that there seemed to be no provision for noticing that a scalar reference was a reference to a scalar that existed as the value of another array or hash element and, thus, it dumped those values incorrectly. For example:

use Data::Dumper;
$Data::Dumper::Purity = 1; # try your hardest
my @values = qw(zero one two three);
my $ref_to_element = \$values[1];
my $all = [$ref_to_element, \@values];
print Dumper($all);


which results in:

$VAR1 = [
           \'one',
           [
              'zero',
              ${$VAR1->[0]},
              'two',
              'three'
            ]
        ];

The problem is that $VAR1->[0] is a reference to one copy of "one," while $VAR->[1]->[1] is a different copy of "one," so changing one won't change the other. The link between the two elements has been severed.

I immediately reported the bug to the Perl developers, but this three-year-old bug has not yet been fixed. Rather than simply raise the issue again, I decided it was time to whip out the coding palette and provide some reference code that can do references correctly, especially since the problem also seems to exist in the YAML library and the Perl debugger's x function as well. (Only Storable seemed to do the right thing—good for them.)

And I'll have to say it was a fun exercise, which I bring to you as Listing 1. Because the listing is rather long, I'll focus on some of the key points rather than my usual rambling style.

The goal is simple: Write an uneval routine, such that the sequence of:

use Data::Stringer qw(uneval);
my $string = uneval(@some_list);
my @new_list = eval $string;

results in @new_list being a deep copy of @some_list, even if the list contains scalars, references to arrays, references to hashes, and blessed references of those. And, of course, references to the thing must not result in the thing being copied, but being referenced instead. For example, the above data gets dumped as:

use Data::Stringer;
my @values = qw(zero one two three);
my $ref_to_element = \$values[1];
my $all = [$ref_to_element, \@values];
print uneval $all;

which results in the string of:

my (@X806f84, @X810114, @X8133a4);
@X806f84 = ('zero', 'one', 'two', 'three');
@X810114 = (00, \@X806f84);
@X8133a4 = (\@X810114);
$X810114[0] = \$X806f84[1];
@X8133a4;

Although this string isn't quite as pretty as the Data::Dumper version, it's more accurate. Notice the next-to-last line, which forces the first element of the result array to be a reference to the second element of the nested array. That's the crucial piece missing in the Data::Dumper version.

The dumping strategy is rather simple minded and broken into two main passes. On the first pass, we walk the supplied list of values, recursively, creating a symbol table %stab, declared in line 17. This is accomplished with a queue of values to be processed in line 31. The %stab hash will end up being populated with three kinds of entries. Scalars have a key of $X followed by a hex address of the actual symbol table address (as returned by stringifying a reference to that item). Similarly, arrays and hashes have @X and %X followed by the hex address, respectively.

Both arrays and hashes hold the reference to the value as the value in the %stab hash. The scalars are a bit different: Their value is a one- or three-element array ref. The first element is a reference to the original scalar value. The second and third elements are populated when we find a scalar with that address as a value within an array or hash that we're scanning. The second element is a name like the keys of %stab (and should map to an entry when pass 1 is complete), and the third element is the array index or hash key. This is the missing piece in Data::Dumper and friends: the record of where a scalar might live if not as a separate symbol table location.

The recursion comes about from the core of pass_1_item, defined in lines 35-68. Each item to be dumped is a reference to a scalar, array, or hash. Line 38 constructs the appropriate %stab key using the ref_to_label routine. This routine is defined down in lines 202-218 and uses overload::StrVal to ensure that we can extract an unoverloaded string value for the reference even if the class has a stringification overload method. $id is the hex address, usually beginning in 0x. Line 208 converts this string into a suitable identifier component. Lines 209-217 sort out the core type (not considering whether or not the reference is blessed), and return back a variable name of the appropriate type to hold the value.

Back up in pass_1_item, we check this string again (line 41) to see its native type. If it's a scalar, line 42 stores the value (possibly autovivifying an array ref: Thank you Perl!), and moves on. If the scalar value is also a reference, then we need to dump the referenced scalar, so the reference is pushed onto the working @queue (line 43).

For an array or hash, things get a bit more complicated because we must keep track of any elements in case they are referenced from somewhere else. The code is similar. First, store the reference into %stab (lines 45 or 54), then walk through the values (beginning in lines 46 and 55). For each element, we take its address and create a %stab entry, noting the containing data structure and key or index used to access the value (lines 49 and 58). And, if the element is a reference to somewhere, we also add it to the work queue (lines 50 and 59).

Speaking of the work queue, we have to allow for the possibility of mutually recursive and self-recursive data structures:

my @one = qw(won one);
my @two = qw(two too to);
push @one, \@two;
push @two, \@one;
my $string = uneval(@one);

As we're scanning @one, we'll need to follow the reference to @two at the end. But when we get to the end of @two, we don't want to scan over @one again. Line 39 handles the duplicate scanning rejection by simply refusing to scan any particular scalar, array, or hash more than once.

The first item dumped is the input parameters. Because the input parameters need to be dumped as the output, we retain the %stab key in line 30 being returned from the first invocation of pass_1_item at line 66. This particular array name will be the designated output array as well.

Once pass 1 is complete, every scalar, array, and hash that belongs to the dumped set has been identified and copied to the virtual symbol table. To dump the data, we merely need to walk this virtual symbol table. The pass_2 routine (lines 74-85) manages the process. The steps can be seen as: declaring the variables, initializing those variables (except for deferred entries), handling the deferred items, blessing any blessed references, and then evaluating the designated top-level array as a result.

First, the declarations are dumped, using pass_2_declarations defined in lines 118-126. A single my construct encloses all scalar, array, and hash names, except for those scalars that exist as elements of another array or hash.

Then the bulk of the work comes out of the initialization phase, starting with pass_2_initializations defined in lines 128-133. Key-value pairs from %stab are passed in to pass_2_initialization, which is defined starting in line 142. If it's a scalar (line 146), it's a simple assignment, unless the value was an element of a larger data structure, in which case, it simply disappears.

The value for any scalar (variable or element) might be a reference to an element of an array or hash, however, and this is where pass_2_value comes in to help. Looking back to the definition (starting in line 87), we see that references to scalars are handled specially. If the reference to a scalar is an element of an array or hash already seen, then line 95 will have a three-element list, setting $place and $index to the actual scalar's location. In that case, we can't provide a scalar value for this initialization. Instead, we add a @deferred element, which does the initialization after all other initializations are complete, and return a 00 value instead. This double-0 value is just a 0, but gives an indication to me staring at the output that this value will be replaced during the deferred stage, just as we saw in the example earlier.

Array and hash initialization works similarly in lines 152-165, except that we have to keep track of which element we are looking at in case the deferred initialization needs to reference an element of a larger structure (as in the previous example).

Once the core initializations are complete, we go back to dump out the deferred initializations (if any). This patches up all the values that were dumped as "00" during the initialization subpass, to point at the elements of arrays and elements of hashes as needed. Then, it's time for a blessing or two, perhaps. Lines 135-140 call pass_2_blessing for each %stab entry, defined in lines 173-188. If it's a scalar, we need to get the actual element out of the array ref, noting its location for the proper blessing if it's also an array or hash element.

Lines 182-187 determine the proper blessed class, getting around any issues with an overloaded stringification once again. And if the value is blessed, the proper blessing is generated in line 184.

The only thing left to describe is quote_scalar, which generates a nice printable representation of a scalar. The undef value is a simple undef return. Otherwise, if the value is safe as a number, the number form is preferred. Otherwise, a single-quoted string is conjured up. I seem to recall that there are numbers that do not stringify well, but I couldn't figure out how to construct one in time for this article deadline. But the die check at the end provides protection in that case, anyway.

So, there it is. A better Data::Dumper that handles references to arrays and references to hashes. Of course, the real Data::Dumper has a lot more bells and whistles, so I hope that the authors of data-dumping routines will use this code as a model, rather than hoping that I will eventually replace their code. Until next time, enjoy!

TPJ



Listing 1

=1=     package Data::Stringer;
=2=     
=3=     use 5.006;
=4=     use strict;
=5=     use warnings;
=6=     
=7=     require Exporter;
=8=     
=9=     our @ISA = qw(Exporter);
=10=    
=11=    our @EXPORT = qw(uneval);
=12=    
=13=    our $VERSION = '0.01';
=14=    
=15=    require overload;
=16=    
=17=    my %stab;
=18=    
=19=    ## $stab{'@x0x123456'} = \@thevalue
=20=    ## $stab{'%x0x123456'} = \%thevalue
=21=    ## $stab{'$x0x123456'} = [\$thevalue]
=22=    ## $stab{'$x0x123456'} = [\$thevalue, $aggregate, $index] # for                              # elements
=23=    
=24=    BEGIN {
=25=    
=26=      my @queue;
=27=    
=28=      sub uneval {
=29=        %stab = @queue = ();
=30=        my $label = pass_1_item(\@_);       # prime the pump
=31=        pass_1_item(shift @queue) while @queue; # drain the pump
=32=        return pass_2($label);                # dump the result
=33=      }
=34=    
=35=      sub pass_1_item {
=36=        my $ref = shift;
=37=    
=38=        my $label = ref_to_label($ref);
=39=        return $label if $stab{$label}; # already seen
=40=    
=41=        if ($label =~ /^\$/) {      # scalar
=42=          $stab{$label}[0] = $ref;
=43=          push @queue, $$ref if ref $$ref;
=44=        } elsif ($label =~ /^\@/) { # array
=45=          $stab{$label} = $ref;
=46=          for my $index (0..$#$ref) {
=47=            for ($ref->[$index]) {     # carefully creating alias, not     # copy
=48=              my $thislabel = ref_to_label(\$_);
=49=              $stab{$thislabel} = [\$_, $label, $index];
=50=              push @queue, $_ if ref $_;
=51=            }
=52=          }
=53=        } elsif ($label =~ /^%/) {  # hash
=54=          $stab{$label} = $ref;
=55=          for my $key (keys %$ref) {
=56=            for ($ref->{$key}) {       # carefully creating alias, not     # copy
=57=              my $thislabel = ref_to_label(\$_);
=58=              $stab{$thislabel} = [\$_, $label, $key];
=59=              push @queue, $_ if ref $_;
=60=            }
=61=          }
=62=        } else {
=63=          die "Cannot process $label yet";
=64=        }
=65=    
=66=        return $label;
=67=      }
=68=    }
=69=    
=70=    BEGIN {
=71=    
=72=      my @deferred;
=73=    
=74=      sub pass_2 {
=75=        my $result_label = shift;
=76=    
=77=        @deferred = ();
=78=        return join("",
=79=                    pass_2_declarations(),
=80=                    pass_2_initializations(),
=81=                    map("$_\n", @deferred),
=82=                    pass_2_blessings(),
=83=                    "$result_label;\n",
=84=                   );
=85=      }
=86=    
=87=      sub pass_2_value {
=88=        my $value = shift;
=89=        my $set_place = shift;
=90=        my $set_index = shift;
=91=    
=92=        if (ref $value) {
=93=          my $label = ref_to_label($value);
=94=          if ($label =~ /^\$/) {    # it is a scalar, so it might be an     # element
=95=            (my ($value, $place, $index) = @{$stab{$label}}) >= 1 or die;
=96=            if ($place) {
=97=              if ($place =~ /^[@%]/) {
=98=                push(@deferred,
=99=                     element_of($set_place, $set_index) . " = \\" .
=100=                    element_of($place, $index) . ";");
=101=               return "00";    # placeholder for a deferred     # action
=102=             } else {
=103=               die "dunno place $place";
=104=             }
=105=           } else {
=106=             return "\\$label";    # no place in particular
=107=           }
=108=         } else {
=109=           return "\\$label";
=110=         }
=111=       } else {
=112=         return quote_scalar($value);
=113=       }
=114=     }
=115=   
=116=   }
=117=   
=118=   sub pass_2_declarations {
=119=     return join("",
=120=                 "my (",
=121=                 join(", ",
=122=                      grep {
=123=                        /^[\@%]/ or /^\$/ and not $stab{$_}[1]
=124=                      } keys %stab),
=125=                 ");\n");
=126=   }
=127=   
=128=   sub pass_2_initializations {
=129=     return join("",
=130=                 map(pass_2_initialization($_, $stab{$_}),
=131=                     sort keys %stab),
=132=                );
=133=   }
=134=   
=135=   sub pass_2_blessings {
=136=     return join("",
=137=                 map(pass_2_blessing($_, $stab{$_}),
=138=                     sort keys %stab),
=139=                );
=140=   }
=141=   
=142=   sub pass_2_initialization {
=143=     my $label = shift;
=144=     my $value = shift;
=145=   
=146=     if ($label =~ /^\$/) {        # scalar
=147=       if (@$value > 1) {          # it's an element:
=148=         return "";
=149=       } else {
=150=         return "$label = ".pass_2_value(${$value->[0]}).";\n";
=151=       }
=152=     } elsif ($label =~ /^\@/) {   # array
=153=       return "$label = (".join(", ",
=154=                                map {
=155=                                  pass_2_value($value->[$_], $label,                           $_);
=156=                                } 0..$#$value,
=157=                               ).");\n";
=158=     } elsif ($label =~ /^%/) {    # hash
=159=       return "$label = (".join(", ",
=160=                                map {
=161=                                  pass_2_value($_) .
=162=                                    " => " .
=163=                                      pass_2_value($value->{$_},                       $label, $_);
=164=                                } keys %$value,
=165=                               ).");\n";
=166=     } else {
=167=       die "Cannot process $label yet";
=168=     }
=169=   
=170=   }
=171=   
=172=   
=173=   sub pass_2_blessing {
=174=     my $label = shift;
=175=     my $value = shift;
=176=   
=177=     ## get to the proper location of an element for scalars
=178=     if ($label =~ /^\$/) {
=179=       $label = element_of($value->[1], $value->[2]) if @$value > 1;
=180=       $value = $value->[0];
=181=     }
=182=     my ($package) = overload::StrVal($value) =~ /^(.*)=/;
=183=     if (defined $package) {       # it's blessed
=184=       return "bless \\$label, ".quote_scalar($package), ";\n";
=185=     } else {
=186=       return "";
=187=     }
=188=   }
=189=   
=190=   sub element_of {
=191=     my $label = shift;
=192=     my $index = shift;
=193=     if ($label =~ s/^\@/\$/) {
=194=       return "$label\[".quote_scalar($index)."\]";
=195=     } elsif ($label =~ s/^%/\$/) {
=196=       return "$label\{".quote_scalar($index)."\}";
=197=     } else {
=198=       die "Cannot take element_of($label, $index)";
=199=     }
=200=   }
=201=   
=202=   sub ref_to_label {
=203=     my $ref = shift;
=204=   
=205=     ## eventually do something with $realpack
=206=     my ($realpack, $realtype, $id) =
=207=       (overload::StrVal($ref) =~ /^(?:(.*)\=)?([^=]*)\(([^\(]*)\)$/)                              or die;
=208=     s/^0x/X/ or s/^/X/ for $id;
=209=     if ($realtype eq "SCALAR" or $realtype eq "REF") {
=210=       return "\$$id";
=211=     } elsif ($realtype eq "ARRAY") {
=212=       return "\@$id";
=213=     } elsif ($realtype eq "HASH") {
=214=       return "%$id";
=215=     } else {
=216=       die "dunno $ref => $realpack $realtype $id";
=217=     }
=218=   }
=219=   
=220=   sub quote_scalar {
=221=     local $_ = shift;
=222=     if (!defined($_)) {
=223=       return "undef";
=224=     }
=225=     {
=226=       no warnings;
=227=       if ($_ + 0 eq $_) {         # safe as a number...
=228=         return $_;
=229=       }
=230=       if ("$_" == $_) {           # safe as a string...
=231=         s/([\\\'])/\\$1/g;
=232=         return '\'' . $_ .  '\'';
=233=       }
=234=     }
=235=     die "$_ is not safe as either a number or a string";
=236=   }
=237=   
=238=   1;
Back to article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.