brian has been a Perl user since 1994. He is founder of the first Perl Users Group, NY.pm, and Perl Mongers, the Perl advocacy organization. He has been teaching Perl through Stonehenge Consulting for the past five years, and has been a featured speaker at The Perl Conference, Perl University, YAPC, COMDEX, and Builder.com. Contact brian at [email protected].
Lately, I've had to do a bit more work on my e-mail inbox to separate the good messages from spam. With some pipelines, some Perl, and a little bit of procmail, I get most of the work done although I have to keep trying new things to keep up with the spammers' tricks.
Spam is getting to be a bigger and bigger problem and many people are saying that most of the e-mail they get is now spam. That's certainly true for me: I get over a hundred spam messages for every valid one.
The trick, so far, is filtering the messages to pick the few good messages out of the torrent of spam and there are plenty of tools to do this. I use a procmail tool called SpamBouncer by Catherine Hampton, and the Spam Assassin Perl package is very popular.
I filter the spam to a folder that I can check later and, when I do, I usually find that a message was incorrectly identified as spam. The sender may use a service provider that I identify as a spam relay, the message may contain an unfortunate combination of keywords (like "On my vacation, I was on the bank of this Nigerian river") or many other things that inadvertently trigger the spam. Each time I find one of these false positives, I add that address to the white list.
Before I start the other spam filters on my incoming mail, I check if the From address is in my local white list. I filter those directly into my inbox:
:0 f * ? formail -x"From:" | egrep -is -f whitelist.txt | /path/to/my/inbox
I seeded my white list with all of the addresses that I had answered in the past three months by using a quick and dirty command pipeline with three parts. Each part does a little bit of processing then hands off the result to the next, which does another little part. This comes from the UNIX philosophy to use a lot of small tools that do specific jobs rather than larger programs that try to do several things at once. I can string together small programs in many ways to do many different tasks:
grep ^From: answered-mail | perl -ple 'chomp; s/^From:\s+//; s/\s*\(.*?\)\s*//; s/.*<(.*@.*)>.*/$1/;' | sort -u
The grep pulls out the From: line in each message (and that picks up the original address in forwarded messages, too, but I decided to keep thoseif I trust the person who forwarded it, I should probably trust the original sender as well). Out of the grep command flows a list of From: lines that goes through the pipe operator and becomes the input of the perl command.
The perl one-liner looks a little too simple, and although I tried more complex Perl, I decided that simpler was better. I do a chomp and three substitutions to get rid of everything in the line but the e-mail address: one to get rid of the From: at the front of the line, and the other two to get rid of everything else but the e-mail address, which usually comes in one of three formats:
- [email protected]
- [email protected] (Barney Rubble)
- "Barney Rubble" <[email protected]>
The second substitution, s/\s*\(.*?\)\s*//, gets rid of the bits in the parentheses, including the whitespace around them, leaving the first and second examples as just the e-mail address. The last substitution, s/.*<(.*@.*)>.*/$1/, matches the bits in angle brackets and gets rid of everything else. There are some cases that get past these substitutions, but in the rare case that happens, I can fix it by hand quicker than I can write code that will handle every possible case.
Out of the Perl portion flows an unordered list of e-mail addresses in whatever order they appear in the input and probably with a lot of duplicates. The sort(1) command takes care of those. It puts the addresses in order and the -u switch removes duplicates.
This is where things get interesting because, even though I've seeded my file, I don't like the order that sort(1) spits out. The various sort functions (command-line and perl's) sort "ascii-betically." They don't know anything about what the data represents or why I care about it. The default sort functions simply compare the inputs character-by-character and, if I use sort(1), my white list is going to be sorted as meaningless strings.
The lines in my white list do have meaning, though. Each line is an e-mail address and, in each address, there is a user name and a domain name. Instead of sorting as meaningless strings, I want to order the addresses by their domain, and within each domain, by the user. For instance, I want all of the "stonehenge.com" addresses to show up before the "tpj.com," and within all the "stonehenge.com" addresses, I want "brian" to show up before "merlyn." That makes it easy for me to see all of the addresses in a particular domain, which I sometimes need if I want to quickly inspect the white list for all of the addresses in a particular domain.
This kind of sort is easy in Perl and it is another pipeline, sometimes called a "Schwartzian Transform." I start with an unsorted list of addresses and I want to end up with a list of unique addresses sorted by domain name. I am going to make my program "pipeline-able." I take input from either the files specified on the command line or standard input. At the end of processing, I send the result to standard output so I can pipe it into another command if I like:
#/usr/bin/perl $, = "\n"; print map { $_->[0] } sort { $a->[2] cmp $b->[2] or $a->[1] cmp $b->[1] } map { [ $_, split /@/ ] } keys %{{ map { chomp; s/\s*\(.*?\)\s*//; s/.*<(.*@.*)>.*/$1/; ( lc $_, 1 ) } grep { s/^From:\s+// } <> }};
There are pipelines within the program, too. I use several Perl list operators and they do the same thing that pipelines do: Take an input list, do something to it, and output another list. I can line these up just like I did with programs on the command line.
I find it easiest to read Perl pipelines from bottom to top (or right to left). Each part of the pipeline depends on the part that comes after it, so perl has to evaluate that part first.
Knowing that, I have two parts to my program: the chunk after the print() and the part after the keys(). The keys() portion gets the addresses and pulls out the unique ones. It is the same thing that I did on the command line, but all in Perl. The print() portion handles the ordering using a map-sort-map pipeline. Each one is a pipeline and the keys() pipeline flows into the one above it.
Inside the keys() pipeline, I have to do a bit of trickery. The keys() function takes a hash, so I need a hash. I know I want to avoid intermediate variables; thus, I start with %{ }. That dereferences a hash reference. Inside that dereference, I need a hash reference, so I use the anonymous hash constructor, {}. When I put those together, I get doubled curly braces %{{ }}.
Inside my doubled curly braces, I have a pipeline built from two list operators and the file input operator (aka, the diamond operator), <>. Without command-line arguments, the <> reads from standard input. It spits out lines of input that become the input for the grep(), which takes the place of the command-line program of the same name. The grep() only passes through lines that make its condition, s/^From:\s+//, return True. The s/// operator returns True if it was able to make the substitution, and I match lines beginning with From: at the same time that I take that string off of the line. The matching lines flow into the map(), which handles most of the stuff I did in my command-line version. At the end of the map(), I return a short list, the e-mail address, and 1, which I use as a key-value pair. All of these key-value pairs come out of the map(), and the anonymous hash constructor puts them together for the keys() function.
The keys() pulls off the unique e-mail address. If I had duplicate addresses in the input, I got rid of the extras when I used the address as a hash key. The keys() command turns the hash into a list I pass into the next major part: the Schwartzian Transform that will order the address.
The Schwartzian Transform has already been dissected at length elsewhere, so I'll skip a lengthy discussion. In short, the transform is a map-sort-map and something other than the original input to order the elements. In the first map(), I create an anonymous array of three elements: the original string in $_, and the user and domain portions of the address I get from splitting the original string. I pass those anonymous arrays to the sort, which compares the domain ($a->[2]) and, if those match, compares the user ($a->[1]). Once ordered, the last map() extracts from each anonymous array the first element, which is just the original sting and the address that I want.
The last step is a print(). I might want to use this program in another pipeline, so I output all of the addresses, one address between line. I previously set the Perl special variable $, which inserts its value between each element of the list I give to print(). In this case, I put a newline between each address and send them to standard output.
Now with this program, I can add new addresses to my white list with a much shorter command line. With a bit of output redirection (>>), I can add those directly to my white list:
% sort_addresses >> white_list.txt
And, I can even use it to reorder my white list.
% mv white_list.txt white_list.bak % sort_addresses white_list.bak > white_list.txt
Because I wrote my program as a pipeline, I can use it to do other things, too, such as list everyone who sent me mail in a particular month:
% sort_addresses read-mail-jun-2004
I can even use this program directly from my newsreader. I use PINE, which has an enable-unix-pipe-cmd option (although your system administrator may have turned this off). While I highlight a message in the index view or read a message, the | key leads to a prompt and, as long as my program is in my path, I can type just its name. PINE shows me the program output:
Pipe message 37 to : sort_addresses
I can also redirect that output directly to my white list:
Pipe message 37 to : sort_addresses >> white_list.txt
Now, I have a program that I created to update my white list so I could pick out the good e-mail addresses from the flood of spam but, since I wrote it to work in a pipeline, I can use it for other things also. Perl lets me do this because it has list operators that also act as pipelines. You may not want to make your entire program one long pipeline, but you can certainly use smaller versions of this example to streamline parts of your program.
References
procmail: http://www.procmail.org/
SpamBouncer: http://www.spambouncer.org/
PINE: http://www.washington.edu/pine/
Schwartzian Transform: http://www.stonehenge.com/merlyn/ UnixReview/col06.html
TPJ