Debbie teaches Perl and assembly at Monash University in Australia. She can be reached at [email protected].
Ifind it quaint, bordering on humorous, hearing North Americans complain about television. If you want to experience truly paleolithic TV, come to Australia, where series are picked up years late, episodes are seldom rerun, and shows disappear for nine or more months at a time, only to return unexpectedly in a different time slot. And that's for the popular shows. The series I like to watch (i.e., geek shows) I'm lucky to find at all. I still can't hear the final credits for Stargate SG-1 without dreading the inevitable voiceover announcing that That Was The Final Episode For Now, and that Stargate Will Return Later This Century.
So I did the natural geek thing and found episode listings at http://www.epguides.com/ for the series I cared about, printed them out, and kept them in a binder under the coffee table. It was a great reference because not only did it give cast lists and episode titles, but it also listed air dates, so in my more morbid moments I could, for instance, look up the fact that Deep Space Nine finished airing on June 2, 1999, yet it's still going here, assuming it ever reappears.
Meanwhile, about a year ago in a desperate bid to get organized, I got a Palm PDA. One of the first things I did was buy a database program for itthinkDB (since bought out by DataViz and rebranded as SmartList To Go)to catalog my collections.
It's probably taken you seconds to make the connection, but I didn't figure it out until recently: Why not use the Palm to keep track of the TV shows?
The Plan
Manual entry was out of the question: Between my partner and me, there are about 20 shows we care about, and most of them are of the Gunsmoke kind and hundreds of episodes long. Besides, I'd still have to visit epguides.com every so often to update the lists. What I really wanted was a program that would write the database for me, and update it whenever I felt the need.
Needless to say, this program had to be written in Perl. Not because Perl is inherently a better language, but because all the necessary modules already existed in Perl. This was more important than usual, because Palm databases are messy to work with.
Palm databases deserve a bit of explanation because they are quite foreign compared with normal files. First, all files on a Palm (even programs) are databases. Second, Palms are historically stingy when it comes to memory capacity. Third, although databases are composed of records, database records don't have any predefined internal structure, such as fields; that's up to the application. Together, these facts mean that every Palm application has its own unique, compact, and incomprehensible record format. This goes double for thinkDB, which tries to implement its own unique, compact, and incomprehensible database format on top of Palm's built-in one.
Luckily, there is a Perl module for manipulating these databases: Palm::PDB. This module reads a PDB file (a flat-file image of a Palm database) and breaks it up into more obvious pieces (doing things like putting its records into an array). If you have a helper class, which is a subclass of Palm::PDB with some required methods, then you might even be able to get hashes containing some internal record structures like fields. Helper classes exist for the Palm PIM databases such as Palm::Address, Palm::Memo, Palm::ToDo, and, fortunately for me, Palm::ThinkDB.
With Palm::ThinkDB, I can open up a database file (obtained from a prior HotSync or using the handy program pilot-xfer), rummage around in its records, add data to fields or even add entire records, and save the resulting database back to a new file (which I would subsequently have to install back onto the Palm using pilot-xfer or another HotSync). Everything in between is merely screen scraping.
Implementation
The first incarnation of my program, available at http://www .csse.monash.edu.au/~debbiep/palm/update-episodes/, did exactly that. The aforementioned screen-scraping consists of downloading web pages from epguides.com and parsing them to obtain series and episode names, cast lists, airdates, numbers, and (I was feeling ambitious) episode synopses. It's quite a straightforward use of the LWP::UserAgent/HTML::Parser combination. Ultimately, I was left with a database file that I could then send back to my Palm.
Actually, not one database file, but two. Keeping track of TV series and episodes is a two-level task because each series can have many episodes. In relational-database talk, there's a one-to-many relationship between the Series table and the Episode table. A thinkDB database can link to another using a foreign key, which is how an episode knows which series it belongs to. An example is presented in Figure 1. Each series has a scheme, which says how to go about obtaining episode details, and a URL, which is used by the scheme to locate series information. They also have some long text fields, one for a list of the show's regular cast (if it can be found) and one to satisfy any copyright or terms of use that the web site might have (Epguides, for instance, says that you have to retain the page's copyright notice). The series' IDs are assigned automatically by thinkDB. Episodes have titles, airdates, production numbers used by the studio, sequence numbers (to ensure that the episodes can be ordered properly), the season they belong to, and episode numbers within that season. They also have synopses, if the series is marked as wanting them, and checksums, which are MD5 digests of all of the other episode fields concatenated together; it's a quick way of determining when an episode's details have changed and need updating. If either the series or episode databases have extra fields, the program leaves them untouched. In this way, I can mark whether I've seen an episode, or give it a numeric rating or a note.
The handler for the scheme I factored out into its own module to facilitate development of new ones if, say, Epguides shuts down or blocks robots or doesn't list a series I care about. The code for the Epguides handler is kept in Scheme/Epguides.pm. Most of it is pretty mundane. Perhaps the most interesting part is guessSeriesURL (line 80 onward), which tries to find the start URL for a series given its name. This is tricky because you want The Practice to match just "Practice," and "Practice, The". The approach I take is to strip the words "and," "an," "a," and "the," all punctuation and spaces and capitalization before looking for one being a substring in the other. It gets a few false positives (several hundred for ER!) but usually it's spot on.
The HTML parsing code in Scheme/Epguides.pm is a fairly typical use of HTML::Parser. It is just robust enough to handle the brownian motion in the HTML format used by Epguides. The parseEpisodes function (line 552 onward) doesn't use HTML::Parser because episode titles are encased in a <pre> section and a regular expression offers finer control.
All of the grunt work of manipulating the database is in the update-episodes program, which is available in full at http:// www.tpj.com/source/. The real business starts at line 143, where the series database is loaded. (See Listing 1; assume $conduit == 0 for the moment, so we skip right to the else statement.) Each series is sent through the processSeriesDBRecord mill, and the resulting database is written out to a new file. The corresponding code for the episode database starts from line 196 (see Listing 2). First, existing episodes are processed, and then any new episodes that might have been found are appended to the database (line 215). Finally, the episode database, too, is written back to a new file.
The processSeriesDBInfo function mostly performs some sanity checks to ensure that all the fields the program needs are present in the database. It also sets some mappings so that the rest of the program knows that, for instance, the "Cast" field is column number 6 in the thinkDB database. This function is also where the handlers for schemes are loaded. processEpisodeDBInfo is analogous to the episode database.
processSeriesDBRecord's job is to update the fields for one series. Mostly, it's a matter of inserting whatever the scheme handler methods returned into the record's fields, and remembering the episode data for when it's time to process the episode database. Most of the function is devoted to dealing with the situation where the user has added a new series to the database on the Palm, and not specified a scheme or URL. In this case, update-episodes prompts for appropriate values, asking the scheme handler module to guess the URL, and stores the user's responses in the record for future reference.
processEpisodeDBRecord updates information about an episode. By now, all of the web accesses have been done, so this function merely has to compare the checksum stored in the episode's record with the checksum computed earlier (line 463), and update the record's fields if any episode data has changed.
The Result
The screenshots in Figures 2, 3, 4 and 5 show the series listing, one series in detail, an episode listing, and one episode in detail. The exact appearance of these screens is up to the user, since update-episodes doesn't meddle with the list and form layouts.
Automating
If you're a Palm user, you're probably aghast at the thought of manually having to fetch two files from the Palm, run update-episodes, and install the new files back to the Palm. Palm users have always been able to put their PDA in the cradle and simply press the HotSync button, and have everything on both the palmtop and desktop computers automatically update through the magic of conduits.
A conduit is a program that runs on the desktop. It knows how to communicate with the Palm during a HotSync and it knows how to update a Palm database (and the corresponding desktop database) so that they are synchronized. Because each application has its own database record format, each application needs its own conduit, too.
While conduits run on the desktop, they are able to instruct the Palm to perform some simple database operations, such as sending a database record to the desktop for analysis, or deleting a record. It's an idea similar in spirit to Remote Procedure Calls (RPC), though the protocol (in two layers, called DLP and SPC) differs in its details. PalmSource (the company) provides DLP APIs for conduits in C, C++, Java, and COM, but any language that can talk to a serial port could do the same. Naturally, someone has already written a Perl API and released it as part of a package called, amusingly, ColdSync.
ColdSync offers several ways of developing conduits, depending on how complex your needs are. At the simplest level, you don't even need to know any of the DLP API, and you can just work on PRC files with Palm::PDB and a helper class, as I've already done. This won't work for update-episodes for two reasons: First, two databases are involved, and the ColdSync generic conduit can work with only one database at a time; second, this is an asymmetric sync because most data needs to be overwritten by the desktop (with its Internet connection) and some data needs to be retained as the Palm currently has (such as the "Seen" field). For a conduit this complex, the only solution is to use the DLP API, the most important functions of which are found in the ColdSync::SPC module.
Fortunately, the current CVS builds of ColdSync come with a new module, ColdSync::PDB, which wraps up DLP calls so that I can (mostly) use the Palm::PDB and Palm::ThinkDB methods that I'm already familiar with. This is what I use in update-episodes. (A note of warning: I had to apply a small patch to ColdSync::PDB to make this work. It may be that the final released ColdSync::PDB API differs, so check http://www.csse.monash.edu.au/~debbiep/ palm/update-episodes/ for news.)
When ColdSync invokes a conduit, it sets $ARGV[0] to "conduit"update-episodes uses this to determine whether to run as a conduit or in standalone mode. The conduit-aware code differs only in small ways from the standalone version, as you can see by comparing the if/else blocks in update-episodes' main code. Mostly, it comes down to a slightly different API, but the other difference is that a conduit is running while connected to the Palm, so that some things that the standalone code must do in a big chunk (such as writing all episode records to the output file) can be done piecemeal by the conduit (which is why update-episodes needs to know if an episode has changed since the last sync; unchanged records don't need to be written again).
So now, once a week or so, I can fire up ColdSync on my desktop, press the HotSync button on my Palm's cradle, and sit back and watch all of my TV series update. For the 38 series and 3700-odd episodes currently in my database, this takes a couple of minutes. With the time I saved, I ought to be able to catch up on more TV shows, assuming, that is, that I can find any of them on Australian TV stations.
TPJ
107 # Work on the series database. 108 { 109 if ($conduit) 110 { 111 my $seriesDBH = ColdSync::PDB->Load($seriesDBName, "rwx"); 112 if (!defined $seriesDBH) 113 { 114 die "500 Can't load series DB"; 115 } 116 117 # Probably shouldn't do this, but how else? 118 my $seriesDB = $seriesDBH->{helper}; 119 120 # Load all of the database's records into helper class. 121 # Normally this'd be pretty inefficient, but we have to 122 # load them all anyway, so let's get it over with. 123 my @records = $seriesDBH->records(); 124 125 # Go through the series database field names and determine which is 126 # which. 127 processSeriesDBInfo($seriesDB); 128 129 # Process each series record and write it back. 130 foreach my $series (@{$seriesDB->{db_records}}) 131 { 132 processSeriesDBRecord($series); 133 # Always write the new record. 134 $seriesDBH->writeRecord($series); 135 } 136 137 # Undefine the database handle to close it. 138 undef $seriesDBH; 139 undef $seriesDB; 140 } 141 else 142 { 143 my $seriesDB = Palm::PDB->new(); 144 $seriesDB->Load($seriesDBName.".pdb"); 145 146 # Go through the series database field names and determine which is 147 # which. 148 processSeriesDBInfo($seriesDB); 149 150 foreach my $series (@{$seriesDB->{db_records}}) 151 { 152 processSeriesDBRecord($series); 153 } 154 155 # Write the episode database, now finished with it. 156 $seriesDB->Write("new_" . $seriesDBName . ".pdb"); 157 undef $seriesDB; 158 } 159 }Back to article
Listing 2
162 # Now load the episode database. 163 { 164 my $episodeDBH; 165 my $episodeDB; 166 167 if ($conduit) 168 { 169 $episodeDBH = ColdSync::PDB->Load($episodeDBName, "rwx"); 170 if (!defined $episodeDBH) 171 { 172 die "500 Can't load episode DB"; 173 } 174 # Probably shouldn't do this, but how else? 175 $episodeDB = $episodeDBH->{helper}; 176 177 # Load all of the database's records into helper class. 178 my @records = $episodeDBH->records(); 179 180 # Go through the series database field names and determine which is 181 # which. 182 processEpisodeDBInfo($episodeDB); 183 184 # Process each episode record and write it back. 185 foreach my $episode (@{$episodeDB->{db_records}}) 186 { 187 if (processEpisodeDBRecord($episode)) 188 { 189 # episode has changed, so write the record. 190 $episodeDBH->writeRecord($episode); 191 } 192 } 193 } 194 else 195 { 196 $episodeDB = Palm::PDB->new(); 197 $episodeDB->Load($episodeDBName.".pdb"); 198 199 # Go through the episode database field names and determine which is 200 # which. 201 processEpisodeDBInfo($episodeDB); 202 203 # Process known episodes first. 204 foreach my $episode (@{$episodeDB->{db_records}}) 205 { 206 # Don't care about whether record has changed, so ignore return 207 # value. 208 processEpisodeDBRecord($episode); 209 } 210 211 }Back to article