The Postman's Sort
Robert Ramey
Robert Ramey is a computer systems developer based in Santa Barbara, CA. He can be reached at 3949 1/2 Foothill Rd., Santa Barbara, CA 93110, (805) 569-3793.
Introduction
This article describes a program that sorts an arbitrarily large number of records in less time than algorithms based on comparison sorting can. For many commonly encountered files, the time required will be strictly proportional to the number of records. This is not a toy program; it can sort on an arbitrary group of fields with arbitrary collating sequence on each field faster than any other program available.
Most computer professionals "know" that the time required to sort a file of N records must increase disproportionately to the number of records in the file. In fact, most will cite Donald E. Knuth's book The Art of Computer Programming: Searching and Sorting. Knuth's chapter on minimum comparison sorting shows that a sorting method which compares records can do no less than Nlog2(N) comparisons. However, this analysis specifically excludes methods that do not use key comparison. In fact, other sections of the book allude to methods for which time is proportional to the number of records sorted.
Why, then, does it seem that every sort utility uses quicksort? Practical sort utilities have to be fast and have to be able to handle a wide variety of keys and key types. The methods described in Knuth's book that do not use comparison depend on the use of fixed-length keys. Also, although they might be faster for a sufficiently large number of records, they are slow for the numbers of records usually encountered. For example, to sort a group of records on a Social Security number using radix list sort would require 30 passes through the file!
There is a way to overcome these deficiencies. In fact, the post office uses it every day. While researching this article (after writing the program), I found that Knuth alluded to it and left it as an exercise for the reader.
A Generalized Distribution Sorting Algorithm
Upon receiving a huge bag of letters, a postal clerk distributes the letters into other bags by state. The bags are sent to the respective states, where another clerk distributes the letters into bags by city. The process continues until the bags are the size one person can carry and deliver. This is the basis for my sorting method, which I call the Postman's SortTM.
Suppose you are given a large file or list of records to be ordered alphabetically on a particular field. Make one pass through the file, adding the record read to one of 26 sublists according to the first letter in the field. The first sublist contains all the records with fields starting with the letter A, while the last contains all the records with fields starting with the letter Z. Now you have divided the problem down to 26 smaller subproblems. The next step is to sort all the records in the A sublist. If there are no records in the A sublist you can proceed to the B sublist. If the A sublist contains only one record, the record can be written to output. If the A sublist contains more that one record, the list must be sorted, then output. Only after the A list has been disposed of can you move on to each of the other sublists in sequence.
The records in each sublist will be written to the output in alphabetical order. What sorting algorithm should be used? Just like a real postman, use the Postman's Sort, applying the method to the second letter of the field. Continue the process to greater and greater depths until all the words starting with A have been written to the output. You can then proceed to deal with sublists B through Z in the same manner.
The above algorithm is implemented in the accompanying program (Listing 1, Listing 2, Listing 3, Listing 4, Listing 5, Listing 6, and Listing 7 do not contain the entire program, but only an interesting portion. The complete program is available on the code disk.). The function psort is passed a sublist. First, it determines how many records the sublist has. If there is only one, the record is written and the function returns. If the sublist is empty, it just returns. Otherwise it advances the pointer into the sort key and continues. plan and allocate reserve memory on the stack of sublist pointers. dist distributes the input sublist among the newly-created sublists. dsort calls sort for each of the newly-created sublists.
Reality Check
A practical sort utility must be able to sort on any fields in any sequence with any collating sequence. The command-line syntax permits specification of delimited fields, start-of-key within each field, and specification of collating sequences for each field. (See the sidebar.) Sorting proceeds according to each key field. The first character within a field which corresponds to a zero-collating sequence terminates the field. The rest of the field is not considered in the sort. Sorting continues on remaining key fields. This means that A<0>C might be written to output before A<0>B in the final output. To prevent this, make sure that every character that could appear within a field is assigned a non-zero collating sequence value.
To be really useful the program must take into account that records can have varying numbers of fields, fields can have varying numbers of characters, and that the command line may specify fields and/or key characters that a particular record does not contain. In general, keys corresponding to nonexistent fields are distributed with a collating sequence of zero. This turns out to be the most natural, convenient, and flexible of the possibilities considered, but it introduces some complications which require refinements of the algorithm.
After distribution is completed by dist, the newly-created sublists are processed by dsort. Sublists corresponding to a zero-collating value are handled first. For these sublists, dsort advances key pointers to the next key field before continuing. The remaining sublists are handled as previously described.
Speeding Things Up
Is this the best we can do? Suppose we wanted to sort 100 decks of shuffled playing cards by value and suit. We could first distribute by suit and then distribute each pile by value. But we probably wouldn't. What we would probably do is to arrange four rows and 13 columns. Each card would be dealt to the pile corresponding to its suit and value. All the cards could be sorted in one pass.
When sorting on the basis of a key field, one way to speed things up would be to distribute among a larger number of sublists, so that the sort could be accomplished in one pass. The sublists would include <0>, A, AA, AB...AZ, B, BA, and so forth. In this case information on 1 + 26(1 + 26) = 703 sublists must be maintained in memory, but the effect is to reduce by half the number of times the key has to be addressed. The basic principle is to sort on the basis of as many bytes at a time as possible. To include all eight-bit bytes in the collating sequence you will need 64KB sublists to sort on two bytes at a time. This is not very practical (at least on Intel-type processors where the maximum-size data segment is 64KB). Fortunately, in the real world you usually know something about the fields you want to sort on. For example, a telephone directory would sort on names, last name first, and, since these fields contain only alphabetic characters, it's possible to take two bytes at a time. Similarly, since Social Security numbers contain only decimal digits, you can take three or four at a time before things get out of hand.
The function plan, called from sort, determines the best number of bytes to take given the collating sequences defined for the key fields. It also sets up pointers to find the key fields. The program is designed to take bytes from more than one field on one pass. This should ensure that the maximum number of bytes possible has been used to distribute the record each time it is addressed. To sort a file of records on a Social Security number in the third field in format 999-99-9999, you could use the command line
psort -k '0'-'9' -f 2 -c 0 -c 4 -c 7The default setup for asort will allocate sublist space to handle three-decimal digits. The first time a record is addressed, it will be distributed on the first three digits. The second time, a hyphen (-) will be encountered at the next key position, so that all records with the same first three digits will be added to sublist 0. The third time, distribution will be on the middle two digits. Enough space will be allocated to distribute on three digits, but only a small portion will be used as each field will be terminated by the hyphen. The fourth distribution will be on the next three digits, and the fifth will be on the last digit and field terminator. To sort a file of records based on Social Security number in the third field in format 999-99-9999, the best command line would be
psort -k '0'-'9' -f 2 -c 0-2 -c 4-5 -c 7-10In this case sort would pass through the records only three times. In general, the more information supplied in the command line, the faster the sort is likely to be executed.
Storage Management
Sublists are implemented as linked lists of records, so that moving records around is fast and simple. As records are read from the standard input, they are processed by the first distribution pass. If all the records don't fit in memory, a temporary disk file containing the sublists is created. Compiling with the Microsoft C v5.21 compact memory model (that's small code/large data), it turns out that, for an average file of less than 280 MB with random alphabetic keys, a given record will be written to and read from the temporary file at most once. If the system is MS-DOS, other problems arise first.
As usual, a disproportionate part of the program is taken up with issues of memory management. I won't address that subject here, as it is peripheral to the content of the article.
Comparing Apples and Oranges
Below I offer an informal analysis of the speed of the Postman's Sort for some different types of files. Comparative analysis of comparison sorting and the Postman's Sort is somewhat difficult since the methods are so different. As a basis for comparison, I use primitive operations, with each time instance of a record's being addressed and/or manipulated in some way counting as one operation. If the primitive operation has some sort of inner loop, as a string comparison does, each cycle of the inner loop counts as a primitive operation.
For comparison sorts, this will be comparing two bytes and perhaps moving records. For the postman's sort, this will be manipulation of a byte from the key field. This is a rough approximation but one that I think is worthwhile.
First, consider the case where a relatively short fixed-length key is used for the sort (e.g., a sort of accounting transactions by date in format mmdd). The number of days in a month or year is small enough that the job should be completed in one pass through the file, distributing records among 2x10x4x10 = 800 sublists. This is determined by using a key-of command line of
-k '0'-'1' -c 0 -k '0'-'9' -c 1 -k '0'-'3' -c 3 -k '0' -'9'In this case the number of primitive operations required by the postman's sort will be equal to the number of records to be sorted times four operations per record (since there are four bytes in the record). Note that this is an example of a sorting method which requires time proportional to the number of records in the file. And it's not just a toy problem.
Using any comparison sorting method the number of comparisons will be at least Nlog2(N). Since the number of records far exceeds the number of keys, the number of primitive operations per comparison will be close to four. Using three as an estimate, the total number of primitive operations will be about 3Nlog2(N). The Postman's Sort will be faster than the fastest comparison sort by the following factor:
For 100,000 records the fastest comparison sort will require 12 times more primitive operations than the Postman's Sort.
Next, consider the case of a relatively long variable-length key, as, for example, keys made of random alphabetic characters. Using the Postman's Sort and taking two characters at a time, the first pass through the file will generate 703 sublists. Each sublist will, in turn, be distributed among 703 sublists, etc., until each sublist has only one record. Hence each record will be addressed approximately log703(N) times where N is the number of records in the file. The total number of primitive operations will be 2Nlog703(N).
Using a comparison sort, most of the comparisons are between keys that are close together. If there are fewer than 26 records in the file most comparisons would be expected to terminate on the first byte. If there are fewer than 26x26 records in the file most comparisons will terminate on the second byte, and so on. This works out to log26(N) primitive operations for each comparison. The number of comparisons is at least Nlog2(N). So the total primitive operation would be Nlog2(N)log26(N).
For this file the Postman's Sort will be faster than the comparison sort by the following factor:
or almost 16 times faster for a 100,000 record file.
Just How Fast Is It?
The above analysis is rather crude in that it makes a number of assumptions to simplify the calculations. It assumes that all types of primitive operations take the same amount of time to complete, and it doesn't allow for the fact that a practical program spends a lot of time in storage management overhead.
To form a more concrete comparison, I generated test files consisting of variable-length records of random alphabetic characters, with an average record length of 15 bytes. The files were 1,000, 10,000 and 100,000 records long (15KB, 150KB, and 1.5MB). For comparison purposes, I used the sort program from the MKS Toolkit from Mortice Kern Systems, assuming that it uses some sort of comparison sort. The results are summarized in Table 1.
My machine is a 12-MHz AT-compatible with two 28ms Seagate 251-1 disk drives. One drive was used for temporary files. The number of drives used affected greatly the time for comparison sort of a large file, but it had little effect on the time required for the Postman's Sort.
The test shows that while the Postman's Sort isn't as fast my theory predicts, it's still twice as fast at the next best thing.
Final Observations
One of the most gratifying features of the Postman's Sort is that the first records are produced on the standard output before the file has even completed reading! In the test file, records that contain just a newline character are output as they are read in. When the last record is read in, the first non-null records appear almost immediately. In a multitasking system with pipes, much time can be saved because the subsequent process can proceed parallel with the sorting process. So, even though the sort itself is only twice as fast, jobs that use it may be improved much more.
© 1991 by Robert Ramey, all rights reserved.