Managing Your MP3 Library in Perl
The Perl Journal July 2003
By Luis E. Muñoz
Luis is an Open Source and Perl advocate at a nationwide ISP in Venezuela. He can be contacted at [email protected].
These days, the practice of backing up one's music CDs as MP3 (or OGG files) has become widespread. I own a few hundred CDs that I've purchased over the years, and a few of them have become coasters through a complex process that involves the passage of time, rough surfaces, and dirt. However, keeping a few gigabytes worth of MP3 files in my laptop's hard drive is not the best way to back up this material. It is also not cheap. When I decided to write the tool I shall describe in this article, I had about 6 GB of MP3 files. I know people who have much more than that, though.
The solution, of course, was to burn a few CD-ROMs with the MP3 files to make room on my laptop's drive. After all, there's no point in carrying eight days of nonstop music with you unless you're a DJ.
Overview
I wanted a database to hold information about the ID3 tag of each song as well as its location in my backup library. This would allow me to quickly locate, say, all the songs from a given artist. Sometimes I find errors in the tags, so I would like my fixes to be incorporated in the library automatically.
I also wanted the process to be simple and ideally, integrated with the Mac OS X environment on my PowerBook, which is where I most wanted to use this tool. I wanted to simply insert a blank CD-ROM in my burner, fire up a command, and wait for the CD-ROM with my MP3s to be ejected. As you'll see later, I came really close.
My solution is based on the excellent File::Findmodule, which simplifies the task of traversing a file tree such as the one typically associated with an MP3 library. For managing operations with file names, paths, and the like, I used File::Spec and File::Path, respectively. This helps me ensure portability for my new tools. I also used MP3::Tag for reading the ID3 tags in the MP3 files. I handle the detection of changes in already archived files with Digest::MD5. The database is maintained with DB_File, Storable, and MLDBM, which allow me to conveniently store a Perl data structure in a file that can be accessed very quickly. Thanks to all these wonderful and free modules, easily found through a search in CPAN, my task became much simpler.
I decided to make multiple command-line tools for more or less specific tasks. This helped me to keep the interface simple enough to be easy to remember. At the time of this writing, there are two tools: mp3cat for copying files, adding them to the database or finding out which files are or are not archived; and mp3dump, a database reporting and backup utility that I hope will make it easier to find a particular song in the archive. I will visit each of those programs in turn and explain their inner workings. (Both utilities are available in their entirety online at http://www.tpj.com/source/.)
Access to the database is encapsulated through the use of a tied hash. Tied hashes provide a very simple model for manipulating data. Normally, all hash operations (especially keys, values, and each) work as expected with a tied hash. MLDBM allows for the storage of serialized Perl data structures, which can be easily read back when needed. To do this, MLDBM uses the help of a database module such as DB_File and a serialization module such as Storable.
mp3cat: Keeping Track of the Database
The first part of the script (Listing 1) handles the specification of the modules to use, loads warnings and strict, which should always be used as they help trap hard-to-find errors such as misspelled variables. After this comes the declaration of variables, specification of command-line options, some error checking in the command-line options with proper error responses courtesy of Pod::Usage, and the specification of the default database name.
Because the database will be accessed through a tied hash, at line 132, I declare the hash and make sure it is empty to begin with. To prevent abrupt interruptions from corrupting the database, lines 134 to 138 show a signal handler that kicks in when the user interrupts the script in the middle of its execution. This handler calls untie on the hash that is tied to the database at lines 140 and 141, to force it to close gracefully, preventing corruption. To be safe, I assume this might not be enough protection, and I regularly backup the database just in case. In the worst case, rebuilding the database is a matter of reinserting all the library CD-ROMs, but it is so much faster to simply copy a file to another directory.
132 my %db = (); 133 134 $SIG{INT} = sub 135 { 136 untie %db; 137 die "User requested interruption\n"; 138 }; 139 140 tie %db, 'MLDBM', $opt_d, O_CREAT | O_RDWR, 0666 141 or die "Failed to tie database $opt_d: $!\n";
Next comes the part of the code that traverses the directories specified in the command line, at lines 145 through 156. This code is quite simple, thanks to File::Find. Basically, we're requesting a recursive traversal of each path in the command line, stored in $dir. For each filesystem object found, the subroutine analyze will be called. Lines 152 and 153 show some customization we've asked for, namely following symbolic links and not changing directories during the directory traversal.
145 for my $dir (@ARGV) 146 { 147 find( 148 { 149 wanted => \&analyze, 150 # Follow symlinks and don't chdir() into 151 # each subdir 152 follow => 1, 153 no_chdir => 1, 154 }, $dir 155 ); 156 }
The subroutine analyze (lines 238 to 285), is responsible for extracting the suitable data from each MP3 file. Note how we restrict the file's extension with a simple regexp at line 240, to avoid unnecessary work. However, you could remove this and use this utility to perform incremental backups of your files. The hash reference $song will be used to store all the information we can get from the song we're processing. The first information element we have is its filename, which is passed by File::Find via the $File::Find::name scalar. I store this information at line 242.
238 sub analyze 239 { 240 return unless $File::Find::name =~ qr/\.mp3$/i; 241 242 my $song = { path => $File::Find::name }; 243 244 if ($opt_s) 245 { 246 my $mp3 = MP3::Tag->new($File::Find::name); 247 248 ($song->{name}, 249 $song->{track}, 250 $song->{artist}, 251 $song->{album}) = $mp3->autoinfo; 252 253 $mp3 = undef; # Free any resources 254 255 unless ($song->{name} or $song->{artist} or $song->{album}) 256 { 257 warn "$File::Find::name contains no understandable tags\n"; 258 } 259 260 $song->{$_} ||= '?' for qw(name track artist album); 261 } 262
When the -s option is specified, the if block on lines 244 through 261 decodes the ID3 tag information that may lie inside the MP3 file. Otherwise, this task is skipped to save time and resources. The information is decoded via a call to the autoinfo method that MP3::Tag provides, at line 251. This will even try to derive information from the filename if no tags are found. Since no more tag-related operations will be done, I request the destruction of the MP3::Tag object at line 253 by assigning undef to the object reference. Line 260 stores a placeholder for attributes in case the data is not available.
263 $song->{size} = -s $File::Find::name; 264 265 my $fh = new IO::File $File::Find::name, "r"; 266 267 unless ($fh) 268 { 269 warn "Failed to open $File::Find::name: $!\n"; 270 return; 271 } 272 273 binmode($fh); 274 275 $song->{md5} = Digest::MD5->new->addfile($fh)>hexdigest; 276 277 $fh->close; 278 279 $song->{file}=(File::Spec>splitpath($File::Find::name))[2];
In line 263, I store the file length in $song->{size} using the -s operator. With the code at lines 265 to 277, I calculate the MD5 signature of the MP3 file. An MD5 signature or "Message Digest" is actually a 128-bit number that is assigned to a sequence of bytes by a series of mathematical operations. I use this to recognize when a file has been changed because of two nice properties of message digests:
- Any change in the file produces a completely different digest.
- Finding two files with the same digest is really difficult.
Note that I read the file in binary mode, as requested at line 273, to prevent differences in the treatment of newlines among different operating systems from causing unnecessary duplication.
Because of the uniqueness of the MD5 signature, this is what we'll use as the key to the database, storing the whole $song hash reference. (If someone ever reports a collision, I promise to expand the key with some other data to avoid it.)
At line 279, I use the services of File::Spec to find the filename in the path name that File::Find gave us through the $File::Find::name scalar. This helps to ensure the portability of the code to operating systems that use different separators in the path names.
284 _perform $song;
Once the data has been collected, a call is made to _perform at line 284 in order to decide what to do with this particular song before going to the next in turn.
194 $song->{vol} = $opt_V || 195 (File::Spec->splitdir 196 ((File::Spec->splitpath 197 (File::Spec->canonpath($song->{path})))[1]))[1] || '?';
The first part of _perform, on lines 194 to 197, attempts to provide a volume name, the name chosen for each CD-ROM in the library, based on the mount point. This code selects the second component of the pathname given as the destination, which tends to work well when mounting out of /Volumes, the default place where Mac OS X mounts new volumes. Of course, the volume specified through the -V command line option takes precedence.
At lines 199 through 231 (Listing 2), an if...elsif block is used to perform slightly different actions depending on the command line options that were specified. These actions might be attempting to copy the file to a given destination through a call to _copy, adding or replacing an entry in the database with a statement such as $db{$song->{md5}} = $song, verifying if this song is already archived with a statement such as exists $db{$song->{md5}} or printing all the song data, such as in line 229. It is vital to use the exists in the verifications, to prevent the autovivification from adding empty entries to the database.
Whenever a reference is assigned to the tied hash %db, MLDBM will use Storable as requested at line 102, to serialize the Perl data structure referenced. The concept of serialization is very important, because it allows for data structures to be stored for later use. This is sometimes referred to as "persistence." In essence, serializing a data structure means to translate it to a representation that lacks things like pointers and references to a process's data. This is often called a flat or serial representation, thus the name. By storing the reference to a hash with all the song data, we can later access this information for other purposes.
Our serialized data is then stored in the database by the DB_File module. This process occurs in the opposite order whenever data is read from the tied hash. It is very important to keep in mind than there is no way to track accesses to the nested structures that might live within the reference. For instance, the following code won't usually work as expected:
# wrong $db{$my_md5}->{title} = "The lost song";
It won't work because the tie interface will fetch the entry corresponding to $my_md5 from the underlying DB_File database and Storable will make sure that a hash reference is reconstructed from the stored data. However, that referenced data, which is being modified in your process, is never being stored back in the database. MLDBM has no way to tell that the referenced data has been altered. A correct alternative is shown below. It works because the reference is fetched from stable storage, modified and then explicitly restored.
# longer but correct my $song = $db{$my_md5}; song->{title} = 'The lost song'; $db{$my_md5} = $song;
Note that in the calls to _copy, a die follows in case of a false return value, as shown at lines 206, 215, and 223. This is useful to stop the process when a CD-R image is full, which helps make the backup process straightforward.
161 sub _copy 162 { 163 my $song = shift; 164 return 1 unless defined $opt_c; 165 166 my $dest = File::Spec->canonpath(File::Spec->catfile($opt_c, 167 $song->{path})); 168 my $dp = (File::Spec->splitpath($dest))[1];
I begin _copy at line 164 by checking whether a file copy destination was specified in the command line with the -c option, returning success otherwise. In lines 166 through 168, I obtain a destination path by combining whatever the user supplied in the command line with the source file path name. This is done with File::Spec in order to achieve portability to other operating systems with different file naming conventions.
170 mkpath([$dp]); 171 172 unless (copy($song->{path}, $dest)) 173 { 174 unlink $dest; 175 warn "Copy error: $!\n"; 176 return; 177 } 178 179 warn "$song->{path} transferred\n" if $opt_v; 180 return 1; 181 }
Once a destination path has been established, the mkpath function from File::Path is used on line 170 to ensure the existence of all the required directory components. Next, I invoke copy from File::Copy at line 172 to attempt the copying of the MP3 file from its original location to the CD image. In case of error, I remove the possibly half-copied file, log the error, and return a false value on lines 174 to 177. If the copy is successful, true is returned.
mp3dump: Simple Reporting
In order to support a simple mechanism for backup, restore, reporting, and general manipulation of the database, I wrote mp3dump. I will omit the explanation of the beginning of the script, as it has a lot in common with mp3cat.
138 if ($opt_r) 139 { 140 my $csv = new Text::CSV_XS; 141 while (my $line = <>) 142 { 143 last unless $csv->parse($line); 144 my @col = $csv->fields; 145 my $song = {}; 146 for my $k (@keys) 147 { 148 $song->{$k} = shift @col; 149 } 150 if (! exists $db{$song->{md5}} or $opt_F) 151 { 152 $db{$song->{md5}} = $song; 153 } 154 } 155 }
The first interesting bit of codethe restore operationappears at lines 138 to 155. Here, I read lines from STDIN or files specified in the command line, using the diamond operator at line 141 and then use the parse method from Text::CVS_XS at line 143 to obtain the columns from a comma-separated file. A $song hash reference is populated with the columns found at lines 146 to 149, which is added if missing or forced through the -F option, at lines 150 to 153.
156 else 157 { 158 for my $song (values %db) 159 { 160 if ($opt_c) 161 { 162 no warnings; 163 my $csv = new Text::CSV_XS { always_quote => 1 }; 164 $csv->combine(map { $song->{$_} } @keys); 165 print $csv->string, "\n"; 166 } 167 elsif ($opt_l) 168 { 169 if (@keys) 170 { 171 no warnings; 172 print join(', ', map { "$_=$song->{$_}" } @keys), "\n"; 173 } 174 else 175 { 176 print join(', ', map { "$_=$song->{$_}" } keys %$song), "\n"; 177 } 178 } 179 } 180 }
All the other command-line options specify a report to be generated from the database, so they appear grouped together in an else statement between lines 156 and 180.
When -c is specified in the command line, the code on lines 162 to 165 generates a comma-separated report. The no warnings at line 162 is necessary to prevent warnings when a given song does not contain a column specified in @keys. The -l option triggers a simpler report, friendlier for grep, which is generated at lines 169 to 177. As you see, managing a database tied to a hash is a trivial task in Perl.
Using the Tools
As an example, here are the commands I used the last time I created an incremental backup of my CD collection:
bash-2.05a$ cp mp3db mp3db.backup bash-2.05a$ ./mp3cat -c /Volumes/MP3_014 -s -V mp3- cd-014 ~/Music
The cp statement makes a simple backup of the database. I always do this before starting, just in case the CD-ROM burning fails or something simply goes wrong. This saves me from having to alter the database in a more complex way.
The invocation of ./mp3cat requests that songs not in the database be copied (-c) to a directory below /Volumes/MP3_014, new songs have their information stored in the database (-s) and that a volume name of mp3-cd-014 (-V) be used. The songs to be processed are in subdirectories of ~/Music, which is where my MP3 library resides. A few minutes later, the command terminates and I happily burn my CD-ROM. If this were to fail, I could simply issue the command:
bash-2.05a$ cp mp3db.backup mp3db
to restore my old version of the database, and start again.
Let's say that I want to load a copy of my database in a spreadsheet program to do some fancy formatting. I could do so with a command such as this:
bash-2.05a$ ./mp3dump -c > my_music.txt
I could then simply import the CSV file. This file also happens to be a backup of the database, which could be easily restored by a command such as:
bash-2.05a$ ./mp3dump -r -d restored_db my_music.txt
Another useful trick is restoring songs from my library. The following command shows how would I go about finding out where certain songs I want are backed up. Note that it would be a very good idea to install agrep alongside these tools, for tasks such as this.
bash-2.05a$ ./mp3dump -C vol,artist -l | egrep -i 'enya' | cut-f1 '-d, ' | sort | uniq -c 21 vol=mp3-cd-008 2 vol=mp3-cd-012
As you can easily see, these tools are very simple yet powerful enough to handle the task. I hope this discussion of these tools gives you a valuable start in using these techniques.
TPJ
Listing 1
1 #!/usr/bin/perl 2 3 # This is free software restricted by the same terms as Perl itself. 4 # (c) 2003 Luis E. Muñoz, All rights reserved. 5 87 use Fcntl; 88 use strict; 89 use warnings; 90 use IO::File; 91 use MP3::Tag; 92 use Storable; 93 use File::Find; 94 use File::Spec; 95 use File::Copy; 96 use File::Path; 97 use Pod::Usage; 98 use DB_File; 99 use Getopt::Std; 100 use Digest::MD5; 101 102 use MLDBM qw(DB_File Storable); 103 104 use vars qw($opt_c $opt_d $opt_F $opt_h $opt_l $opt_n $opt_s $opt_v $opt_V); 105 106 our $VERSION = do { my @r = (q$Revision: 1.6 $ =~ /\d+/g); sprintf " %d."."%03d" x $#r, @r }; 107 108 getopts('c:d:FhlnsvV:'); 109 110 if ($opt_h) 111 { 112 pod2usage 113 { 114 -verbose => 2, 115 -exitval => 255, 116 -message => "\n*** This is $0 version $VERSION ***\n\n", 117 }; 118 } 119 120 if (defined($opt_s) + defined($opt_n) + defined($opt_l) > 1) 121 { 122 pod2usage 123 { 124 -verbose => 1, 125 -exitval => 255, 126 -message => "Only one of -s, -l and -n can be specified at the same time", 127 } 128 } 129 130 $opt_d ||= './mp3db';
Listing 2
199 if ($opt_s) 200 { 201 # If the song is not in the DB or the 202 # -F option is given, store it 203 204 if (! exists $db{$song->{md5}} or $opt_F) 205 { 206 _copy($song) || die "Terminating due to copy failure\n"; 207 print $song->{path}, " stored\n" if $opt_v; 208 $db{$song->{md5}} = $song; 209 } 210 } 211 elsif ($opt_n) 212 { 213 unless (exists $db{$song->{md5}}) 214 { 215 _copy($song) || die "Terminating due to copy failure\n"; 216 print $song->{path}, "\n"; 217 } 218 } 219 elsif ($opt_l) 220 { 221 if (exists $db{$song->{md5}}) 222 { 223 _copy($song) || die "Terminating due to copy failure\n"; 224 print $song->{path}, "\n"; 225 } 226 } 227 elsif ($opt_v) 228 { 229 print join(', ', map { "$_=$song->{$_}" } keys %$song), "\n"; 230 } 231 }