Archiving and Compression with CGI
By Randal L. Schwartz
Tens of thousands of compressed tar archives probably exist out there on the Web to download. I can't imagine how much time, even on a high speed connection, it would take to download them all. A major problem with compressed archives, whether tar, zip, or otherwise, is that even if the publisher of the information has carefully bundled only the most important files, sometimes I really want only part of the data. However, I'm forced to download the entire thing (perhaps over a slow connection in a hotel room, as I often am), to discard only the parts I don't want.
The Data publisher should provide a way to build a custom-compressed archive, with only the files or directories that I choose. That's what we have this month: A CGI program that lets a user choose individual distribution files, on an item-by-item basis. Once a user has made his or her choice, a specific tarred and gzipped file (.tar.gz) appears on the server for the user to download. The code for the script that does this is in Listing 1.
As a side note, this month's column idea was suggested by fellow Stonehenge Perl trainer and Usenet poster extraordinaire, the one and only Tom Phoenix ([email protected]). It's based on similar code he wrote for me to handle the downloads of exercise data and answers for our on-site training classes. I wrote the code in Listing 1 from scratch, though, so if it isn't exactly what he was suggesting, that's my fault.
Let's get started. Normally, line 1 of the listing would have had the preferred -Tw flags after the path to Perl, but I ran into some unavoidable problems with both taint mode and the warnings turned on. First, the standard File::Find module is not "taint safe," so that's a loser. (I think this was corrected in Perl 5.6, but I haven't started using that version on my production site yet.) Second, I'm using two variables from File::Find, and with warnings enabled, I get the ugly "used only once" warning, which is annoying at best.
Line 2 turns on compiler restrictions, forcing me to declare my variables (no use of a default package), disabling barewords (no "Perl poetry mode"), and symbolic references (no variables that contain other variable names).
Line 3 unbuffers the STDOUT output stream, ensuring that any CGI header I've generated appears before any program I fork. Forking is necessary here to launch the UNIX tar program.
Line 5 pulls in the standard CGI::Pretty module, which has the same parameters as the CGI.pm module, but generates nicely indented HTML. It's a little slower to run, but there's a fairly small amount of HTML generated by this program, and I wanted to be able to read it easier. The :standard parameter generates the function shortcuts, rather than requiring us to use the object-oriented interface, which seems to involve a lot more typing for not a lot of real gain.
Line 6 pulls in one of my favorite modules, CGI::Carp (also found in any recent standard Perl distribution). Here, I'm redirecting any fatal runtime errors to the browser rather than having to hunt around for them in the server's error log. Please note that this is a potential security hole, as it reveals sensitive information to any random user out there on the Net. So don't use this in production code (but you aren't supposed to be using my programs as-is for production, anyway).
Line 7 sets the PATH environment variable to something that doesn't trip up tainting or permit additional security holes. Note that the tar utility needs to be found in one of these directories.
You'll find the only configurable part of this code in line 11. Here, I'm specifying the directory in which I'm storing the distribution files. Subdirectories below it define particular distributions, and must not begin with a dot or a dash. (So a directory named with morse code would definitely be forbidden.) As a security precaution, the script will not list symbolic links, either for directories or files, so it's important that your data really exists below this directory.
Line 15 sets the current working directory to the top-level directory. Although errors cause the script to die, this merely triggers the CGI::Carp module to spit the error message out to the browser. In production code, this death should send a simple and innocuous "something broke" message to the browser, and should write a detailed explanation to the log file or send one via email to the Webmaster.
The rest of the program is upside down, in the sense that the server eventually runs the script three times, and the lines of code for the first pass are at the end of the listing. So I'll describe it from back to front.
First Pass
The first time the script is run on the server, it simply generates a list of available archives and sends the resulting HTML back to the user's browser.
Line 60 generates the HTTP header, and the HTML header (including the <title> definition).
Lines 63 through 65 locate the distributions. We read the dot directory, meaning our current directory, looking for names that don't begin with a dot or dash, and are directories (but not symbolic links.) For consistency, the resulting filenames are sorted, regardless of the unpredictable order in which they're returned from the readdir function.
Lines 66 to 69 generate an HTML form, using the radio_group function to output radio buttons. Line 66 generates the actual form tag, with an action equal to the URL of this script. By default, the method is to POST. Line 67 generates a single-column table with one or more radio buttons in a group. The user will select one of the buttons, and click on the submit button that is created by the code in line 68.
One fun feature of the CGI.pm module is that the values used in line 67 for the @names list are automatically HTML entity escaped, meaning that any less-than signs (<) are converted to their proper HTML codes (<) and so on. Of course, the browser re-escapes the information as we come back the other way (from < to <), so it doesn't matter if the directory names have odd characters in them. For testing, I used a name that contained both less-thans and spaces, and it worked just fine. Thanks to Lincoln Stein, for CGI.pm!
Second Pass
So, once the user selects a distribution and submits the form, we return to the script to process the inputted data. During this second pass, the code in lines 26 to 56 is invoked. The second pass is detected by a non-false value in the $dist variable created in line 28.
Line 29 examines the value of $dist to ensure that the user picked a valid distribution. Even though we give the user a choice of valid directory names, we must distrust the value returned because it would be trivial for a malicious person to fake any return value to the script, possibly giving the user access to formerly restricted files. The first check is to ensure that it's a name that doesn't start with a dash or dot, and doesn't contain any slashes. It also has to be a directorynot a symbolic link to a directory, but an actual directory.
If that's all OK, we copy the value of the $1 variable into the $dist variable, to untaint it. I did this before I had to turn off taint checking because of File::Find's bad behavior.
Speaking of which, we pull in the File::Find module in line 32. I do this as a require statement, instead of use, so that I don't load a lot of unused code on the first and third passes. The downside of this is that the File::Find::find function isn't imported, so I have to call it explicitly in line 34. Line 33 sets up the @names list, stuffed with appropriate names in line 37. (Sorry for the forward and backward references there, but that's how they match up.)
So, the find function is called in lines 34 though 38. The function looks at all the pathnames below $dist, which is in turn below the current directory. Line 36 forces any names that begin with a dash or a dot to be ignored, and ensures that subdirectories beginning with a dash or a dot are not examined. You achieve this by setting the $prune variable to true, which notifies the File::Find::find function that we don't want to descend into those directory levels. Line 37 puts the full pathname (relative to the top-level directory) into @names if it's a file and not a symbolic link.
Once we have the names, we start creating the CGI response in line 39. Line 42 begins the form, generating a self-referencing URL with a slight twist. If the CGI script is invoked as /cgi/getdist, we set the action URL to:
/cgi/getdist/nnnnnnnnn.tar.gz
The n's represent a numeric value based on the epoch timea number that increases by one each second. It is currently nine digits long as I write this, and will roll over to ten digits in early September 2001.
The script ignores the trailing name, but when the invocation generates the compressed tar archive, the browser will likely download the file to this name (or at least default to it). This will uniquely identify this download, making it unlikely to conflict with any other file in the user's download directory.
Next, lines 44 to 52 display a table (used for layout), headed up by (often bolded) column titles for the name, size, and last modification date and time. Each name from the @names list is passed through the map to generate one row. Similar checkbox items with an identical parameter name, but differing return values, are generated. The checkbox defaults to being selected, defined by the number 1 in line 49. Line 50 computes the file size using the -s parameter, and line 51 gets the modification time converted as a human-readable string.
Final Pass
When a user clicks on the submit button generated in the code on line 53, we come back to this same script in the final pass that generates the tarred and gzipped archive. Lines 19 through 24 handle the code for this pass. This is by far the easiest one, as we merely need to extract the @names from the selected response checkboxes, then dump out an appropriate MIME header (line 20). We also verify that the names aren't trying to select the /etc/passwd directory or anything else scary (line 21), and then let the tar program do all the hard work. I'm presuming you're using a GNU tar program here that can take a z flag to handle the gzip compression. This step gets slightly more complicated if you have to use the gzip program before using a non-GNU version of tar, but you can always just forgo the compression.
If you're feeling quite adventurous, you can use the Archive::Tar and Compress::Zlib directories (both found in the CPAN) to generate the compressed archive without using an external program. (Perhaps I'll do that in a future column.)
And that's all there is to it! To start making your custom-selected compressed tar archives, stick your distribution files below the configured top-level directory, and link to the CGI URL from some convenient page.
The technique of generating a compressed tar archive on the fly can also be applied to a "shopping cart" strategy. You can let users select files from different sections of your Web site, and maintain the list either as hidden fields in the forms on the client side, or via some session-ID technique on the server side. When you're ready to generate the archive, be sure to invoke the URL to this script with the appropriate extra path information so that the download name is set appropriately. Be sure to revalidate all the requested names; don't let bad guys grab arbitrary files this way.
Tom Phoenix got famous (again) by suggesting this month's column idea. If you have some snippet of an idea that can be handled by 30 to 300 lines of Perl, drop me a note. If I use your idea, maybe you'll be famous! Until next time, enjoy.
Randal has coauthored the must-have standards Programming Perl, Learning Perl, and Effective Perl Programming. You can reach him at [email protected].