C/C++

All Sorts of Sorts

By P.J. Plauger, June 07, 2007

I have already described my frustration at using parametric sort packages. And I can't say I've done any better than others with my own designs. The sort-key parser I wrote was nice, but it was hardly more flexible than the UNIX sort utility. Unless the locale file reader includes a C compiler, it is hard to provide adequate flexibility. And having written the odd C compiler over the years, I knew better than to try anything that ambitious.

Nevertheless, I did design and write several parametric versions of strcoll and strxfrm. Each worked for an interesting subset of cases. Each foundered on the shoals of cultural diversity. The problem was that I knew too much about what people wanted, but I didn't know enough about how to deliver.

What I knew was courtesy of IBM and the POSIX subcommittee on internationalization. Both organizations have studied collation rules used around the world. Both have a stake in helping computer users implement those rules. Let me give you a few examples.

The easiest collation rule to implement is one where strxfrm transforms each string unchanged. That makes strcoll equivalent to the old warhorse C function strcmp. The ordering is determined by the codes assigned to the execution character set. I chose that behavior for the C locale you get a program startup. I figured it was most culturally neutral. A rule almost as simple is to make letter comparisons case insensitive. You get this behavior by translating, say, all uppercase letters to lowercase. The mapping still produces one character for each character in the input string. That leaves you one step away from a dictionary sort. Such an ordering rule is case insensitive, but it also ignores any punctuation. In the extreme version of a dictionary sort, you discard all characters other than letters. Thus, "can't" and "cant" sort equal, and "cane" sorts before "can't" regardless of the values assigned punctuation codes. (I assume letters sort in alphabetical order.)

I've mentioned a peculiarity I discovered with Swedish telephone directories. Seems they treat I and J as equivalent in determining sort order. You can handle such rules by an obvious variation on the dictionary sort. Simply map equivalent characters to the same character in the transformed string.

Most languages other than English have characters with accent marks. Occasionally, these are encoded as a sequence of two characters -- an accent mark followed by the letter. That derives from the days of mechanical typewriters with "dead" keys for the accent marks. The carriage spaced only after you struck the letter following. The modern trend, however, is to define separate character codes for each combination of letter and accent. Display a binary file on your terminal screen, and you will probably see some of them. The ordering rules for accented characters (and other funny characters) strain the bounds of creativity. Some simply sort the same as their unaccented cousins. That's easy to deal with. Some sort as if they were some other combination of letters. (The German S-tset, which looks roughly like a Greek beta, often sorts as if it were ss.) That's still not bad.

Where life gets exciting is when you get to the disambiguating rules. Seems some cultures are happy to order accented characters the same as unaccented, provided the rest of the two strings differ. If they prove to be equal, one form of the letter precedes the other. That's somewhat messier to capture in a transformed string.

What you have to do is transform the string more than once. Take one pass over the string replacing each character with its sorting equivalent. Then tack a distinctive marker on the end. Then take a second pass over the string adding characters for each one that may have to be disambiguated.

Here is an example you can read. Let's say you want to order strings by dictionary order. That means ignoring case distinctions among letters and effectively discarding nonletters. But you also want distinct strings to compare unequal. Effectively, you want to take a second look at strings that sort equal by the dictionary rule. On the second pass, you want to perform a comparison of the original text. One way to do this is to:

Change all upper-case letters to lower-case and discard all the nonletters
Append a period to this string
Append the original string following the period. Thus, the string "Can't Happen" becomes "canthappen. Can't Happen". It will sort very close to "can't happen", but will not compare equal to it.

The reports I read go on for pages. Every time you think you know all possible sorting rules, some central European country throws you another breaking fast ball. It's fun reading if you're perverse.

My final solution was almost as extreme as allowing C code in locale files. I went back to the simplest programmable system that is also powerful, fast, and flexible. I coded strcoll and strxfrm in terms of a table-driven, finite-state machine. What you put in the locale file is the specifications for the code tables. Table 2, for example, is how you specify the modified dictionary sort.

collate[0,0]'-'	$0 $I $1
collate[0,1:$#]	$I $0
collate[0,'a':'z']$@	$0 $I $0
collate[0,'A':'Z']$@+'a'-'A"	$0 $I $0	collate[1,0:$#]$@	$0 $I $1

Table 2: Specifications for dictionary sort.

Again, I don't pretend this is readable. I won't even try to interpret it for you at present.

All I will say for now is that this approach has proved fruitful. I have been able to encode a wide variety of comparison rules with this (albeit cryptic) scheme. It is satisfying terse, so as not to built up a locale file. And I think it is even reasonably efficient.

Previous 1 2 3 4

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

C/C++

All Sorts of Sorts

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

C/C++

All Sorts of Sorts

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

C/C++ Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content