Simon is a freelance programmer and author, whose titles include Beginning Perl (Wrox Press, 2000) and Extending and Embedding Perl (Manning Publications, 2002). He's the creator of over 30 CPAN modules and a former Parrot pumpking. Simon can be reached at simon@ simon-cozens.org.
In last month's column, we saw how to put together a Curses-based RSS feed reader. We began by looking around on CPAN and assembling a collection of modules that did most of the job for us, and spent the rest of the article merrily plugging them together. When it comes to the second part of the problemmaking the RSS reader talk to existing databases of feed informationthen things are a little different. We're on our own.
Where We Are Now
At the moment, our copy of press has a hard-coded list of RSS feeds that we want it to read. Each time we restart press, it forgets what we've already read; if you remember the original point of writing press, it was to enable me to take home the data from my normal RSS reader, NetNewsWire, to use its caching and record of which articles had been read, and then return to my usual network source and have NetNewsWire carry on as if I'd been reading all the time.
To do this, we'll need to both read and write the format used by NetNewsWire, and then alter our press program to use the NetNewsWire files as a data source instead of our hard-coded list. To make this a generic change, we'll split off the data-store parsing code into a module, Press::NNW1; this means that if I later move to a new version of NNW, or indeed any other news reader, I can simply create a module to read and write its data stores, and slot that in with no changes to press.
NNW uses two sources of information about what we're reading: The first is an XML "plist" file of the feeds we're subscribed to, and the second is a SQLite database of which articles we have seen.
A plist is an XML data serialization format invented by Apple, and is capable of storing more complex kinds of data than our simple list of URLs and titles. In fact, NNW stores a hierarchy of feeds, allowing us to group feeds into categories and subcategories. So part of my preferences file looks like this:
<dict> <key>isContainer</key> <string>1</string> <key>name</key> <string>Tech News</string> <array> <dict> <key>home</key> <string> http://www.google.com/googleblog/ </string> <key>name</key> <string>Google Blog</string> <key>rss</key> <string> http://www.google.com/googleblog/atom.xml </string> ... </dict> <dict> <key>home</key> <string> http://www.groklaw.net </string> ... </dict> </array> </dict>
Plist dict elements, or dictionaries, are a bit like hashes (and array elements are, well, arrays) so here we have a hash that is a container, called "Tech News," which contains some feeds, themselves expressed as hashes. We can use the Mac::PropertyList module to turn this into a Perl data structure:
use constant PREFS => "Library/Preferences/ com.ranchero.NetNewsWire.plist"; my $file ||= (getpwuid $<)[7]."/".PREFS; my $stuff = Mac::PropertyList::parse_plist_file($file);
To get the effect of supporting groups of feeds, we'll actually turn this into a flat list of titles, some of which will only be used for display purposes. That is, something like:
@titles = ( "Torgo-X zhoornal", "[ Tech News ]", " Slashdot", " Planet Perl", " Groklaw", "[ Friends ]", " No-sword", ... );
Notice that those entries in brackets are not real RSS feedsthey're just titles for the relevant group. Also note that we've indented the feeds inside each group by a space.
The second piece of information we have from NetNewsWire is a file in Library/Application Support/NetNewsWire called "History.db." Looking at this file, it contains a piece of useful information:
% head -1 ~/Library/Application\ Support/ NetNewsWire/History.db ** This file contains an SQLite 2.1 database **
Looking at it with the ever-so-handy DBI::Shell, we find that it is very simply constructed: one history table, which contains the columns link, flRead and flFollowed (flags to signify if the link was read or followed), and dateMarked. The link is, obviously, the link of each item in the feed, although it is munged a little from being merely a URL: It has a number in front of the URL and one at the end. We can guess that these are related to the ability to detect when the content of an article has changed, and indeed, with a little playing around, we find that the first number is the length of the headline and the second number is the length of the content. We're going to have to both read and write these rows in the history table to ensure that the history file is kept updated when we're using press. Now that we've seen what we want, how do we get there?
Easy is in the Eye of the Beholder
I used to have long talks with people about the usefulness (or lack thereof) of "design patterns" and what patterns there are for Perl programming. It took me a while to realize that I used many design patterns in my Perl programming, but I did so unconsciously and instinctively. So for me, a list of Perl design patterns wouldn't make sense because I automatically recognize the pattern I need when the situation calls for it.
The situation we have with turning the nested XML structure into a flat array is one such case. When I was confronted with this, I was already halfway through writing a recursive closure, and suddenly the problem was solved. Here's how I did it. We'll start with the outline of the code:
use constant PREFS => "Library/Preferences/ com.ranchero.NetNewsWire.plist"; use constant DBPATH => "Library/Application Support/NetNewsWire/"; sub get_feedlist { my ($self, $file, $db) = @_; my $file ||= (getpwuid $<)[7]."/".PREFS; my $db ||= (getpwuid $<)[7]."/".DBPATH."History.db"; my (%labels, @values); my $stuff = Mac::PropertyList::parse_plist_file($file); # Do some magic with $stuff here return (\@values, \%labels); }
As with last time, @values will be the list of blog names and group names, and %labels is the mapping between names and feed objects. For the moment, we'll only think about filling @values by extracting the names from the plist. Since the data structure is intrinsically recursive, we should already be thinking of a recursive walk over the plist tree. Something like this:
sub walk_plist { my ($data, $depth) = @_; my @rv; for my $feed (@$data) { if ($feed->{isContainer}) { # It's a group, not a feed, # so note it and recurse. push @rv, (" "x $depth). "[".$feed->{name}->value."]", walk_plist($feed->{childrenArray}, $depth + 1); } else { # It's an ordinary feed push @rv, (" "x $depth). $feed->{name}->value; } } return @rv; }
This is great, except for two things: First, we have to deal with two variables (@values and %labels); and second, we're passing around a lot of values as we build up the various lists. Isn't there an easier way to do it? Well, "easier" is rather subjective, but by turning this subroutine into a closure, we can avoid passing around the variables, and keep everything tidily in one place.
sub get_feedlist { my ($self, $file, $db) = @_; my $file ||= (getpwuid $<)[7]."/".PREFS; my $db ||= (getpwuid $<)[7]."/".DBPATH."History.db"; my (%labels, @values, $walker); my $stuff = Mac::PropertyList::parse_plist_file($file); $walker = sub { my ($stuff, $depth) = @_; for my $feed (@$stuff) { if ($feed->{isContainer}) { push @values, (" "x $depth). "[".$feed->{name}->value."]"; $walker->($feed->{childrenArray}, $depth + 1); } else { # Make an appropriate XML::RSS object $rss = ...; push @values, $rss; $labels{$rss} = (" " x $depth).$name; } } } $walker->($stuff, 0); return (\@values, \%labels); }
Since we're using a closure, we've got access to the @values and %labels hashes in the enclosing scopewe don't need to pass anything around because it's all there. And as it's a closure, the recursion works just fine, too. Of course, we skipped over a little bitmaking the XML::RSS objects. This is where the history database comes in.
History Repeating
Just as we needed a special subclass of XML::Feed to handle feeds while controlling when articles get marked as read, we'll need another subclass of that to control reading and writing the marked-read status to the NetNewsWire database.
package XML::Feed::NNW1; use base 'XML::Feed::Manual';
First, we'll create an accessor for the history database file, so that we can tell each feed where to look to find its database:
sub history { my $self = shift @_; $self->{history} = shift if @_; $self->{history}; }
We need this because the constructor calls an accessor for each of the options passed to it. Next, we need to hook into the headline-loading process, to mark the items that the database says are read. _load_cached_headlines is a sensible place to do this:
sub _load_cached_headlines { my $self = shift; $self->SUPER::_load_cached_headlines; # Now load the history my $dbh = DBI->connect( "dbi:SQLite2:$self->{history}" ) or return;
At this stage, we have a list of headlines. We check the database for each one to see if they've been seen before:
my $sth = $dbh->prepare( "SELECT * from history where link = ?" ); for my $head ($self->headlines) { $sth->execute($self->encode_headline($head)); $self->SUPER::mark_seen($head) if @{$sth->fetchall_arrayref}; } }
The encode_headline method will be used to transform the article's URL into the encoded form used by NetNewsWire in its storage. We call the superclass' mark_seen, because our own mark_seen is going to look like this:
use Time::Piece; sub mark_seen { my ($self, $head) = @_; $self->SUPER::mark_seen($head); my $dbh = DBI->connect( "dbi:SQLite2:$self->{history}" ) or return; $dbh->do(" insert into history (link, flRead, flFollowed, dateMarked) values (?, 1, 0, ?) ", {}, $self->encode_headline($head), localtime->strftime("%Y%m%d")); }
Time::Piece is a handy little module that allows us to parse and format the current time in various ways, and we can use that to spit out the date in the format used by the history database. Again, we need to encode the headline. Once we've done that, we put a row into the history file so that it's marked when we go back to using NetNewsWire.
Finally, we need to do two things: The first is to implement encode_headline. We need to prepare the headline and description by trimming leading and trailing space from them:
sub encode_headline { my ($self, $head) = @_; my $d = $head->description; my $h = $head->headline; chomp ($d, $h); $d =~s/^[\n\s]+//; $d =~ s/\s+$//; $h =~s/^\s+//; $h =~ s/\s+$//;
Then we return the URL bracketed with the length of the description and then length of the headline:
return $d.$head->url.$h; }
Finally, we can create these XML::Feed::NNW1 objects in our recursive closure:
my $url = $feed->{rss}->value; my $name = $feed->{name}->value; my $rss = XML::RSS::Feed::NNW1->new( url => $url, history => $db ); $labels{$rss} = (" " x $depth).$name;
And we have to tell press not to use our static list of feeds, but instead to use the Press::NNW1 module to get its list:
use Press::NNW1; my ($values, $labels) = Press::NNW1->get_feedlist(); $feedbox->values($values); $feedbox->labels($labels); bolden_news();
And now press can read and write NetNewsWire 1 histories!
Moving On
Let's take a step back and look at what we have: As we mentioned earlier, we now have a main program (press) and a separate library that handles everything about the subscription and history information. When NetNewsWire 2 hits the big time, all we need to do is create a Press::NNW2 library, which reads the plist files and formats for that application. Someone is working on having a Press::Bloglines library to interface to that as well; by making press modular, anyone can create back ends to whatever RSS syndication service they like.
Using the same modular system, it would be easy to create a program that read-in syndication data from any source and reproduced it in another format, to aid migration between RSS applications. But that's not really what I had in mind for pressI just wanted something to keep me up-to-date with the world.
TPJ