brian has been a Perl user since 1994. He is founder of the first Perl Users Group, NY.pm, and Perl Mongers, the Perl advocacy organization. He has been teaching Perl through Stonehenge Consulting for the past five years, and has been a featured speaker at The Perl Conference, Perl University, YAPC, COMDEX, and Builder.com. Contact brian at [email protected].
Template::Extract
looks very cool. It has three (out of three) 5-star ratings on CPAN Ratings, several effusive blog entries, and a hack in Spidering Hacks. However, it really isn't as cool as all of the hype. At least not yet. I ran into some problems with its parsing, but fixed it up so I can specify my own parser instead of its default one. It doesn't solve all of the module's problems, but it got me through today's project.
I started with something simple. I have colon separated fields and values, with one pair per line. I just want to get those values back. The actual template I wanted to use was a bit more fancy, but the other parts were working fine.
#!/usr/bin/perl use Data::Dumper; use Template::Extract; my $tte = Template::Extract->new; my $template = <<'TEMPLATE'; foo: [% foo %] bar: [% bar %] TEMPLATE my $document = <<"MAIL"; foo: fred bar: barney MAIL my $data = $tte->extract( $template, $document ); print Dumper( $data );
I think this is pretty simple, but it comes out wrong. The values have newlines after them even though the newlines come after the template directives. Somehow Template::Extract
thinks they belong to the values rather than the template.
$VAR1 = { 'bar' => 'barney ', 'foo' => 'fred ' };
If I change the template and output to have two template directives next to each other, the output is further away what I want. Both values show up in one directive, leaving the other one empty.
#!/usr/bin/perl use Data::Dumper; use Template::Extract; my $tte = Template::Extract->new; my $template = <<'TEMPLATE'; foo: [% foo %] bar: [% bar %] [% baz %] TEMPLATE my $document = <<"MAIL"; foo: fred bar: barney rubble MAIL my $data = $tte->extract( $template, $document ); print Dumper( $data );
I expected that in the "bar" line, Template::Extract
would come up with some regex that said something like "b a r : space something space something newline". Instead, it gave me "b a r : space something".
$VAR1 = { 'bar' => '', 'baz' => 'barney rubble ', 'foo' => 'fred ' };
This might not seem strange to you at first, and I was a bit forgiving at first too, but why should the process ignore literal characters in the template? The newline and now the space from the template show up in the values from the template directives instead of their literal, hard-coded values in the template. Notice, however, it correctly leaves out the whitespace that comes after the colon.
People like this module because they're using it in templates that have a lot of other text, and some template directives. That is, there is a lot of non-whitespace hints that Template::Extract
can use in those cases. Not all templates are so simple though, or have extra hints to help the parser.
If I change my template to make it look like a bit of HTML (invalid as it may be), Template::Extract
does a fine job:
#!/usr/bin/perl use Data::Dumper; use Template::Extract; my $tte = Template::Extract->new; my $template = <<'TEMPLATE'; foo: [% foo %]<br/> bar: [% bar %] [% baz %]<br/> TEMPLATE my $document = <<"MAIL"; foo: fred<br/> bar: barney rubble<br/> MAIL my $data = $tte->extract( $template, $document ); print Dumper( $data );
It correctly picks out all the values because it doesn't have a chance to react to whitespace. It uses other hints for where things end or begin.
$VAR1 = { 'bar' => 'barney', 'baz' => 'rubble', 'foo' => 'fred' };
I investigated this a bit more because I really want to use this module. I need to look at the template and use more hints that what it is using. First, I just want to solve the newline problem. I want to see the regexen that the module uses. If I can figure how to change the regex, maybe I can work backwards to a code fix. I get the regexen as the return value of compile()
, and then use
Data::Dumper
to inspect it.
#!/usr/bin/perl use Data::Dumper; use Template::Extract; my $tte = Template::Extract->new; my $template = <<'TEMPLATE'; foo: [% foo %] bar: [% bar %] [% baz %] TEMPLATE $regexen = $tte->compile( $template ); print Dumper( $regexen );
I can now see the regular expressions that Template::Extract
will use. The (?{})
part of the regular expression allows perl to run a bit of code in that spot. That code is the _ext()
function from Template::Extract::Run
. Other than that its not that scary. Notice that it looks for "f o o : space stuff" followed immediately by "b a r : space stuff" and so on. I don't see my literal new line in there.
$VAR1 = 'foo\\:\\ (.*?)(?{ _ext(([\'foo\'], $1, 1)) })bar\\:\\ (.*?)(?{ _ext(([\'bar\'], $2, 2)) })(.*)(?{ _ext(([\'baz\'], $3, 3)) })';
That's about all I can do at the user level, so I need to look under the hood in Template::Extract::Compile
. (Note, an easy way to look at the source of a module is to find the file with `perldoc -l`, then load that path into your favorite editor.) It's pretty easy to find the culprit. The module tells Template::Parser
to pre- and post-chomp the directives so the whitespace around the directives disappears. It's not Template::Extract
that doing it, at least not directly.
my $parser = Template::Parser->new( { PRE_CHOMP => 1, POST_CHOMP => 1, } );
What happens if I modify that so I have my own Template::Parser
setup? Just to try it, I change the values so it doesn't chomp
anything. I simply change the 1s to 0s.
my $parser = Template::Parser->new( { PRE_CHOMP => 0, POST_CHOMP => 0, } );
Now the regexen from my last script show the whitespace and thus can use the literal whitespace as hints to decode the cooked template. On the line above "bar", there is really a newline at the end, and that's what the \\ escape (twice, because they are in a string and one needs to survive and stay in the regex.
$VAR1 = 'foo\\:\\ (.*?)(?{ _ext(([\'foo\'], $1, 1)) })\\ bar\\:\\ (.*?)(?{ _ext(([\'bar\'], $2, 2)) })\\ (.*)(?{ _ext(([\'baz\'], $3, 3)) })';
That still doesn't solve my newline problem. With this change, I still get a newline after 'rubble'. Although there is a literal newline after the 'foo' line in the regex, there isn't one at the end of the 'bar' line.
$VAR1 = { 'bar' => 'barney', 'baz' => 'rubble ', 'foo' => 'fred' };
That little bit isn't from Template::Parser
though. After Template::Extract::Compile::compile()
sets up the parser, it modifies the template a bit before it passes it on to the parser. It takes off all of the newlines at the end of the string.
$template =~ s/\n+$//;
As a side note, I thought this was the culprit of the first problem until I realized it doesn't use the /m
modifier at the end (meaning it doesn't remove newlines from the end of 'lines', just the end of the string. I comment out that line and everything works how I want it. The values don't have trailing newlines.
$VAR1 = { 'bar' => 'barney', 'baz' => 'rubble', 'foo' => 'fred' };
Now that I know what I want to do, like a good open source programmer I need to make a patch that make my knowledge useful to the rest of the world. Now that I know how I want it to work, I need to translate that into code. Just like I don't want to use Template::Extract
's idea of what the parser should do, other people probably don't want my idea. I can change the compile()
function to take a parser as an optional second argument. If I don't supply that second argument, or the second arguments isn't a Template::Parser
(or one of its subclasses), it makes its parser like it did before. And, instead of trimming all the newlines at the end of the file, I keep one.
sub compile { my ( $self, $template, $parser ) = @_; $parser = undef unless UNIVERSAL::isa( $parser, 'Template::Parser' ); $self->_init(); if ( defined $template ) { $parser ||= Template::Parser->new( { PRE_CHOMP => 1, POST_CHOMP => 1, } ); $parser->{FACTORY} = ref($self); $template = $$template if UNIVERSAL::isa( $template, 'SCALAR' ); $template =~ s/\n+$/\n/; $template =~ s/\[%\s*(?:\.\.\.|_|__)\s*%\]/[% \/.*?\/ %]/g; $template =~ s/\[%\s*(\/.*?\/)\s*%\]/'[% "' . quotemeta($1) . '" %]'/eg; return $parser->parse($template)->{BLOCK}; } return undef; }
This isn't the only subroutine I have to change, since most people won't call compile()
directly. Most people will use Template::Extract::extract()
, which then calls compile()
for them. That function needs to know about the extra argument too. The extract()
method takes its argument and re-arranges them for its call to run.
sub extract { my( $self, $template, $document, $values, $parser ) = @_; $self->run( $self->compile($template, $parser), $document, $values ); }
Now everything works for me. All I have to do is send in the patch.
TPJ