Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C/C++

All Sorts of Sorts


I have already described my frustration at using parametric sort packages. And I can't say I've done any better than others with my own designs. The sort-key parser I wrote was nice, but it was hardly more flexible than the UNIX sort utility. Unless the locale file reader includes a C compiler, it is hard to provide adequate flexibility. And having written the odd C compiler over the years, I knew better than to try anything that ambitious.

Nevertheless, I did design and write several parametric versions of strcoll and strxfrm. Each worked for an interesting subset of cases. Each foundered on the shoals of cultural diversity. The problem was that I knew too much about what people wanted, but I didn't know enough about how to deliver.

What I knew was courtesy of IBM and the POSIX subcommittee on internationalization. Both organizations have studied collation rules used around the world. Both have a stake in helping computer users implement those rules. Let me give you a few examples.

The easiest collation rule to implement is one where strxfrm transforms each string unchanged. That makes strcoll equivalent to the old warhorse C function strcmp. The ordering is determined by the codes assigned to the execution character set. I chose that behavior for the C locale you get a program startup. I figured it was most culturally neutral. A rule almost as simple is to make letter comparisons case insensitive. You get this behavior by translating, say, all uppercase letters to lowercase. The mapping still produces one character for each character in the input string. That leaves you one step away from a dictionary sort. Such an ordering rule is case insensitive, but it also ignores any punctuation. In the extreme version of a dictionary sort, you discard all characters other than letters. Thus, "can't" and "cant" sort equal, and "cane" sorts before "can't" regardless of the values assigned punctuation codes. (I assume letters sort in alphabetical order.)

I've mentioned a peculiarity I discovered with Swedish telephone directories. Seems they treat I and J as equivalent in determining sort order. You can handle such rules by an obvious variation on the dictionary sort. Simply map equivalent characters to the same character in the transformed string.

Most languages other than English have characters with accent marks. Occasionally, these are encoded as a sequence of two characters -- an accent mark followed by the letter. That derives from the days of mechanical typewriters with "dead" keys for the accent marks. The carriage spaced only after you struck the letter following. The modern trend, however, is to define separate character codes for each combination of letter and accent. Display a binary file on your terminal screen, and you will probably see some of them. The ordering rules for accented characters (and other funny characters) strain the bounds of creativity. Some simply sort the same as their unaccented cousins. That's easy to deal with. Some sort as if they were some other combination of letters. (The German S-tset, which looks roughly like a Greek beta, often sorts as if it were ss.) That's still not bad.

Where life gets exciting is when you get to the disambiguating rules. Seems some cultures are happy to order accented characters the same as unaccented, provided the rest of the two strings differ. If they prove to be equal, one form of the letter precedes the other. That's somewhat messier to capture in a transformed string.

What you have to do is transform the string more than once. Take one pass over the string replacing each character with its sorting equivalent. Then tack a distinctive marker on the end. Then take a second pass over the string adding characters for each one that may have to be disambiguated.

Here is an example you can read. Let's say you want to order strings by dictionary order. That means ignoring case distinctions among letters and effectively discarding nonletters. But you also want distinct strings to compare unequal. Effectively, you want to take a second look at strings that sort equal by the dictionary rule. On the second pass, you want to perform a comparison of the original text. One way to do this is to:

  • Change all upper-case letters to lower-case and discard all the nonletters
  • Append a period to this string
  • Append the original string following the period. Thus, the string "Can't Happen" becomes "canthappen. Can't Happen". It will sort very close to "can't happen", but will not compare equal to it.

The reports I read go on for pages. Every time you think you know all possible sorting rules, some central European country throws you another breaking fast ball. It's fun reading if you're perverse.

My final solution was almost as extreme as allowing C code in locale files. I went back to the simplest programmable system that is also powerful, fast, and flexible. I coded strcoll and strxfrm in terms of a table-driven, finite-state machine. What you put in the locale file is the specifications for the code tables. Table 2, for example, is how you specify the modified dictionary sort.

collate[0,0]'-' $0 $I $1
collate[0,1:$#] $I $0
collate[0,'a':'z']$@ $0 $I $0
collate[0,'A':'Z']$@+'a'-'A" $0 $I $0 collate[1,0:$#]$@ $0 $I $1
Table 2: Specifications for dictionary sort.

Again, I don't pretend this is readable. I won't even try to interpret it for you at present.

All I will say for now is that this approach has proved fruitful. I have been able to encode a wide variety of comparison rules with this (albeit cryptic) scheme. It is satisfying terse, so as not to built up a locale file. And I think it is even reasonably efficient.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.