Eric has worked as a Perl programmer for more than five years and is currently working toward graduate school. He can be reached at [email protected]
Mastering Perl for Bioinformatics
James D. Tisdall
O'Reilly & Associates, 2003
396 pp., $39.95
ISBN 0-596-00307-2
Mastering Perl for Bioinformatics, James Tisdall's sequel to his earlier Beginning Perl for Bioinformatics (2001), continues and extends the approach of the earlier book. Both books are aimed more toward working biologists seeking to learn Perl than at working programmers seeking to get into bioinformatics work. The new book is divided into two parts. Part I covers an introduction to data structures and object-oriented programming in Perl. Since the previous book restricted itself to declarative-style programming, this book is mostly concerned with developing the OO style used in most standard bioinformatics libraries. Part II gives a variety of one-chapter introductions to more specialized areas such as relational databases and graphics.
Part I, "Object-Oriented Programming in Perl," is written at a more elementary level than the manpages (e. g., perlbot, perltoot, perltooc) that accompany the Perl distribution, and is tailored for people who want more handholding, elaboration, and repetition of the concepts than can be found in the Perl documentation itself. Also, the simple fact that all the examples are motivated by problems in bioinformatics makes this discussion much easier to understand than the generic Foo.pm sort of examples used in the Perl documentation.
Tisdall opens with a discussion of modules and CPAN, developing a small sample module, Geneticcode.pm, which translates a string of DNA to their corresponding proteins. (The genetic code proper is implemented as a hash.) Then he introduces data structures by developing an approximate-string matching algorithm (suitable for finding mutated versions of the same gene) that uses two-dimensional arrays. This is a beautiful algorithm, the use of references is explained carefully, and the bioinformatic motivation is clear.
Chapter 3 is the heart of Part I, and we are treated to three successive versions of a Gene object, each using (and thoroughly explaining) a successively larger set of Perl's OO features. What we've learned is immediately put to work in the next chapter in the development of a class that will read in genes (and write them out again) in any of several different standard formats: a very familiar bioinformatics problem of the last decade or so. Some of the choices in class design seem strange to me. It is not clear why Tisdall makes the ">" write-mode flag (an invariant feature of Perl syntax) a class attribute called "_writemode" in some of the file access classes, but perhaps Tisdall feels this improves readability for novices. His explanation of the use of closures, on the other hand, is the clearest and most careful I have seen so far.
Finally, a restriction enzyme database system is developed. This chapter may be very challenging for programmers who are new to biology because many of the biological terms discussed are only very briefly explained, if at all. Classes are developed to implement both the restriction enzyme database, and to map the binding sites (cutting points) for various restriction enzymes on any given gene. The Rebase system extends similar software that was developed and discussed in the earlier book, but this time around, everything is done according to standard object-oriented techniques.
Part II deals with four topics: RDBMS, web programming, graphics programming, and the installation and testing of the modules of the Bioperl project (bioperl.org). The database chapter gives a very brief introduction to RDBMS concepts, touching on the various levels of normal form and database design. The examples all use MySQL. Perl's DBI modules are explained and their use demonstrated. Finally, the Rebase system is extended by moving its originally DBM-file-based data into an RDBMS.
The web-programming chapter explains the benefits of web-based user interfaces (which are extremely widespread among bioinformatics applications) and then introduces the CGI module. Tisdall again extends the Rebase system, equipping it with a web-driven front end. The chapter on graphics programming introduces the GD.pm module, and extends Rebase by developing code to draw restriction maps. It is disappointing that none of the graphics of restriction maps produced by this code are illustrated in the book, although some ordinary spreadsheet-type graphs produced by the GD-based code are reproduced. Finally, we get a very detailed look at the process of installing and testing the Bioperl modules from CPAN.
This final chapter shows all the warts of working with new open-source code, which may often have bugs in the documentation and inadequate tutorials. Tisdall does an excellent job of detailing the process of debugging a new installation, correcting bugs in tutorial code, and identifying missing pieces that you can get by without. At the end of the chapter, Tisdall explains the use of an invaluable debugging function of the bptutorial.pl script, which gives you a list of all methods invoked by any given Bioperl method, and names and locations of the Bioperl modules in which each of those methods is defined. This chapter on Bioperl will be very helpful for anyone who has ever been stymied in attempting to use CPAN to install a large and complex set of modules, as several of the most common problems are given detailed examples, along with a thorough account of how to work through to the solution of each one.
The book concludes with two appendices, a summary of Perl and a guide to installing Perl on various platforms. The summary is excellent, using bioinformatics-motivated examples throughout. It often gives examples that work together well and concisely illustrates aspects of Perl that I hadn't noticed as clearly before. The book is marred by a few typographical errors of the sort that might be especially confusing for beginners (e.g., the mysql prompt is occasionally shown as "mysql>"). I expect these will be fixed by the second printing, and despite them, the book is well worth reading.
This book is a very good way for someone who has been doing bioinformatics with traditional declarative programming to get started with object-oriented Perl, and with the development and use of OO module libraries. It also gives a good first-hack approach to web programming and RDBMS programming, although any bioinformaticist needing to do more extensive work in these areas should seek further information elsewhere. Some of the explanations (in particular, the discussion of closures and some parts of the Perl summary in Appendix A) are clearer than I have seen elsewhere. I would heartily recommend this book to any working biologist wanting to master OO-style Perl (although it would be good to start with the predecessor volume if you're completely new to Perl), and would recommend it with only a few reservations to experienced Perl programmers wanting to start working with bioinformatics problems. Most of these reservations are due to the book's detailed explanations of things that require no explanation for the latter audience, while being a little vague about the workings of, say, reading frames and restriction enzymes. Several of these biological concepts are explained in the predecessor volume, but that volume is going to seem even more elementary (from the programming point of view) to experienced Perl programmers.
TPJ