Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Database

Algorithm Alley


Oct00: Algorithm Alley

David is a programmer/analyst at the UCLA Center for Digital Innovation and runs FileJockey Software. He can be reached at http://www.FileJockeySoftware.com/ or [email protected].


There are many occasions when character strings, such as filenames, might contain sequence numbers. For example, images in a web page may be numbered since it is not always worthwhile to assign meaningful names. When such strings are sorted, they look wrong if they increase, while the numbers in these strings sometimes decrease. For example, the sequence Img1, Img10, Img2, ..., Img9 would look right and be easier to read if it were arranged as Img1, Img2, Img3, ..., Img10.

A program to sort strings combines a sorting algorithm such as QuickSort() with an ordering function such as C's stricmp(). When stricmp() or a similar ordering function is selected, the result is called "alphabetic sorting" because the program compares one character at a time from each string using the computer's alphabet.

In contrast, an alphanumeric-ordering function treats digits in strings as parts of numbers, and orders these numbers properly. Then numbers within strings increase when the strings increase. The benefits of ordering in this way include ease of reading sorted strings and it being easier for a person to find the string containing a specified number or the highest number. In this article, I'll present a simple alphanumeric-ordering function and describe another function suitable for comparing long filenames.

Workarounds

There are two ways of formatting numbers within strings so that increasing alphabetically ordered strings have increasing numbers -- add leading zeros or leading spaces. These methods consist of deciding upon a maximum number of digits and prepending numbers that have fewer digits with either zeros or spaces.

How do you decide how many digits to use? If you guess wrong, you either type unnecessary character(s) for each string, have to rename all previous strings once you use a number with more than the selected number of digits, or accept some strings not being positioned properly. In addition, these methods cannot be used with read-only strings, may reduce readability, and are a nuisance to apply manually.

Simple Alphanumeric Ordering

Listings One and Two are source code for simple alphanumeric ordering. These files are written in C and include the instructions needed for C++ compilation. If you compile for an operating system other than Windows, you may need to remove the call to _strnicoll() (this function provides locale-based caseless comparisons). In many cases, you will not notice a difference in the sorting order between using this function and subtracting lowercase characters.

To recognize digits within strings, I use the isdigit() function. Although this function works in a static library, I found that it fails when included in a DLL. If you use isdigit() in a DLL, test it to see if it works properly. If not, substitute something like '0' <= c && c <= '9'. Instead of being defined as specified, isdigit() is defined in <ctype.h> using a bit mask. As a result, it may be slightly faster than using the previous pair of tests.

The function SimpleANCompare() in Listing Two loops through a pair of strings based on their respective character positions. If both strings have a digit at their current position, then this function uses atoi() to compare the numbers within these strings. By design, atoi() reads characters until it encounters a character that is not part of a number. In contrast, Java's Integer.parseInt() requires a substring that contains only digits (other than a minus sign for the first character).

If the numbers in both strings are equal, then SimpleANCompare() moves past the digits and returns to the top of the loop. If not, this function returns an integer that is useful to a sorting algorithm. By convention, this integer is negative if parameter one is less than parameter two, zero if they are equal, and positive if parameter one is greater than parameter two.

If at least one string does not have a digit at its current position, then the current characters are compared alphabetically using either _strnicoll() or subtracting the lowercase equivalents of these characters. You can turn locale sorting on or off by calling SetSNSLocaleSort() with an integer that is nonzero or zero, respectively.

SimpleANCompare() works when the first digits of whole numbers to be compared start at the current position in each string and these numbers are less than 231 or 2,147,483,648 (so that atoi() provides valid results). Notice that since this function moves past equal numbers, it can handle strings that contain several numbers in which the first ones are equal. As a result, it can compare strings that contain multiple dots between numbers, such as IP addresses. Figure 1 shows the results of simple alphanumeric sorting applied to selected filenames.

Full Alphanumeric Ordering

Win32 began allowing filenames with up to 255 characters starting with Windows 95. These filenames have several features that affect how to order them alphanumerically. To begin with, as a result of being long, such filenames may contain numbers greater than 231-1. Comparing such numbers cannot be done with atoi().

When you copy a file and paste it into its current folder, Windows will make a file whose name starts with "Copy of." If you copy it again, the new filename will begin with "Copy (2) of." Subsequent copies will have increasing numbers in their parentheses. Handling a set of such filenames requires allowing for numbers in parentheses and dealing with the missing number.

In addition to allowing more characters in filenames, Windows now allows multiple dots in filenames. With this change, filenames could contain fractional (or decimal) numbers. An example is "Item 3.879.txt." At first it seems that the logic mentioned earlier for multiple-dot numbers would work here too. However, the fractional part should be sorted alphabetically so that 3.879 is less than 3.9.

Figure 2 shows how simple alphanumeric sorting works on selected filenames, and Figure 3 shows how full alphanumeric sorting arranges these filenames. Notice that the full version addresses all of the problems previously mentioned. (You can download the SortNice demo program from http://www .hotfiles.com/?0011FB.)

Windows Programming Attempts

While attempting to incorporate alphanumeric ordering into Windows components, I found that the ListBox control did not provide useful WM_COMPAREITEM messages. (This may have been fixed by now.) In contrast, the ListCtl component does allow for substituting a comparison function to replace alphabetic ordering. As a result, I use this control in my SortNice application.

Using the Spy++ tool that comes with Visual C++, I determined that the filename-list window of the standard File Open/Save dialog box also uses a ListCtl component. However, I have not found a way to replace the ordering function used by this dialog box with my function.

Pursuing this matter further may not be worthwhile since incorporating alphanumeric ordering into some File Open/Save dialog boxes but not others would present an inconsistent experience to users. On the other hand, programs that show a list of files within an archive or sort other alphanumeric text would benefit from using alphanumeric ordering.

Conclusion

In many situations, the simple alphanumeric-ordering function presented can properly arrange strings containing numbers. Such strings include sequentially numbered filenames and IP addresses. When this function is not adequate, the full version may be needed. As shown here, this version is suited for Windows long filenames. Due to its generality, it should also work for filenames allowed by other operating systems. To provide a consistent experience to users, full alphanumeric ordering should be incorporated at the operating-system level and provided as an API function.

DDJ

Listing One

// SNSimpleLib.h -- Alphanumeric Ordering
#ifndef SNSIMPLELIB_H
#define SNSIMPLELIB_H
#ifdef __cplusplus
extern "C" {
#endif

int  SimpleANCompare (const char *pszName1, const char *pszName2);
void SetSNSLocaleSort (int bLocaleSort);

#ifdef __cplusplus
}
#endif
#endif  // SNSIMPLELIB_H

Back to Article

Listing Two

// SNSimpleLib.c -- Alphanumeric Ordering
#include <ctype.h>      // for isdigit, tolower
#include <stdlib.h>     // for atoi
#include <string.h>     // for _strnicoll (Windows)
#include "SNSimpleLib.h"

static int g_bLocaleSort = 1;

// Action:  Simple alphanumeric comparisons.
// Notes:   Works when there is a digit in the same starting place in
//    each filename. Limited to numbers smaller than 2^31, around 2 billion.
int SimpleANCompare (const char *pszName1, const char *pszName2) {
    register const char *pszN1 = pszName1, *pszN2 = pszName2;
    int nTest;
    while (*pszN1 != '\0' && *pszN2 != '\0') {
        if (isdigit (*pszN1) && isdigit (*pszN2)) {
            if ((nTest = atoi (pszN1) - atoi (pszN2)))
                return nTest;
            // Moves past the numbers
            while (isdigit (*pszN1)) pszN1++;
            while (isdigit (*pszN2)) pszN2++;
        }
        else {
            if (g_bLocaleSort)    // locale caseless match
                nTest = _strnicoll (pszN1, pszN2, 1);           
            else                  // caseless match
                nTest = tolower (*pszN1) - tolower (*pszN2);
            if (nTest)
                return nTest;
            else {
                pszN1++;  pszN2++;
            }
        }
    }
    return (*pszN1 - *pszN2);   // one string has ended
}   // SimpleANCompare

// Action:  Turns locale sorting on or off.
// Note:    This functionality may need to be removed for non-Windows OSes.
void SetSNSLocaleSort (int bLocaleSort) {
    g_bLocaleSort = bLocaleSort;
}

Back to Article


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.