Text Editors: Algorithms and Architectures

From Stallman's GnuEmacs to Microsoft's Word, text editors are one of the most taken-for-granted, yet most often used, applications around. When done right, however, the choice of core algorithms and how they're implemented in the overall architecture can make the difference between a good editor and a great one.


April 01, 1993
URL:http://drdobbs.com/architecture-and-design/text-editors-algorithms-and-architecture/184408975

This article discusses algorithms for implementing text editors. You may perhaps think that such a subject is passe: "Why do I need to know about this? My nifty new programming language (or class library or application framework) lets me write a text-editor application in just 12 lines of code." That may be true--and if your application's text-editing requirements are fully satisfied by the built-in edit widget that came with your programming environment--then party on, dude.

But, in the Windows environment at least, most text-editing chores ultimately fall on the shoulders of the built-in Edit control, which is poorly equipped to handle tasks like tabbing, multiple fonts and point sizes, and large files. Those 12 lines of code using your whizzy class library are often just the overhead for invoking a class that ultimately invokes the built-in Edit control.

Ironically, the genesis for this article was a conversation with the designer of an award-winning app framework, in which I was asked to suggest articles on text-editor algorithms that could be of help in creating a more powerful text-edit class for his tool. It turns out there isn't much written on the subject, despite the dozens of public-domain editors available in source-code form. Instead, there is an oral history of text-editor algorithms, passed down from master to novice as part of a rite of passage.

The Simplest Case

Let's consider the simplest example of text editing. Listing One consists of one routine, EditLine(), which allows the user to edit a single line of text on the screen. Chances are you've written a routine resembling this. The sample spreadsheet program in Borland's Turbo C++ package, for example, has a similar routine (and, in fact, this example is an adaptation of that routine).

Listing One


/* EditLine() -- The simplest text editing routine */
void EditLine(char* buffer, int max_length, int curr_row)
{
    int c, str_length  = strlen(buffer), curr_column = str_length,
           insert_mode = TRUE;
    ChangeCursorShape(insert_mode);
    do
    {   vt_ClearLineAt(curr_row,0);
    vt_OutputStringAt(buffer, curr_row,0);
    vt_SetCursorPositionAt(curr_row,curr_column);
    switch(c = vt_GetKeystroke())   /* dispatch on user's keystroke */
    {
         /*-------- keystrokes that terminate the editing session-----*/
         case ESCAPE_KEY:
         case ENTER_KEY:     break;
        /*--------- keystrokes that merely change the cursor position---*/

       case HOME_KEY:      curr_column = 0;                break;
        case END_KEY:       curr_column = str_length;           break;
            case LEFT_KEY:      if (curr_column > 0) curr_column--; break;
        case RIGHT_KEY:     if (curr_column < str_length) curr_column++;
                break;
        case INSERT_KEY:    insert_mode = !insert_mode;
                ChangeCursorShape(insert_mode);
                break;
        /*------ keystrokes that alter the contents of the buffer----*/
        case BACKSPACE_KEY: if (curr_column > 0)
                {
                    movmem( &buffer[curr_column],  /*source*/
                        &buffer[curr_column-1], /*dest*/
                        str_length - curr_column + 1);
                    curr_column--;
                    str_length--;
                }
                break;
         case DELETE_KEY:   if (curr_column < str_length)
                {
                    movmem( &buffer[curr_column+1], /*source*/
                    &buffer[curr_column],       /*dest*/
                    str_length - curr_column);
                    str_length--;
                 }
                 break;
         default:        if (((c >= ' ') && (c <= '~')) &&
                        (str_length < max_length))
                 {
                    if (insert_mode)
                    {
                        movmem(
                            &buffer[curr_column],
                        &buffer[curr_column + 1],
                        str_length - curr_column + 1);
                        str_length++;
                    }
                    else if (curr_column >= str_length)
                        str_length++;
                    buffer[curr_column] = c;
                    curr_column++;
                  }
                  break;
         }
         buffer[str_length] = '\0';
    }

    while ((c != ENTER_KEY) && (c != ESCAPE_KEY));
}

EditLine() accomplishes, in scaled-down form, many of the basic tasks that required of a full-fledged text editor:

This sequence repeats endlessly until the Return key is pressed, providing us with the most primitive example of an event-driven program.

In looking at the code, you'll notice that this routine does a lot of unnecessary work (which, in the case of small amounts of text, does not impact the user). For example, when only the cursor is moved and no text has changed, the entire line of text is cleared and redrawn. Further, the same happens in cases where only one character has changed. Likewise, when a character is deleted, it's only necessary to display the text on the right-hand side of the cursor. Not printed here, but included with the electronic version of the listings (see "Availability," page 5), is an improved EditLine2() routine, which contains these and other obvious optimizations.

Scaling Up to Spaghetti

Although a full-fledged word processor or commercial desktop-publishing program is in some sense a beefed-up version of EditLine(), it isn't a trivial matter to scale up this routine to a real program. Time and again, novice programmers have followed the garden path of extrapolating from this simple function into a full-screen text editor via incremental improvements. Sometimes the result is eminently usable, but just as often the program collapses under the accumulated weight of haphazard changes. Those who persist end up rediscovering many of the architectural principles and algorithms described here.

Because text editors aren't traditional subjects for computer-science textbooks, written materials on implementing them is scarce. By themselves, the algorithms used in text editors aren't complex or challenging in a formal mathematical sense. Rather, the challenge is more pragmatic in nature--to make a disparate collection of simple algorithms work together in a coherent manner. Still, I did run across one excellent resource, Craig Finseth's The Craft of Text Editing: Emacs for the Modern World (Springer-Verlag, 1991). Finseth is coauthor of several Emacs-type editors, including Mince, FinalWord, and Freyja. I also found a similar, but less complete, discussion in the electronic archives of the Internet news group comp.editors. The files editech.[1-4].Z contain short but illuminating comments by Joseph Allen and Stephen Trier on how to implement a text editor.

However, none of these authors place text-editor algorithms in the larger context of interactive graphics programs. Implementing a text editor is a subclass of a more general problem--the same problem faced by implementors of "draw" programs and other graphical editing tools. Whether manipulating textual or graphical objects, the fundamental goal is the same: to maintain a consistent mapping between a data model that exists in an application-specific data space, and a visual representation of that model that exists in a graphics coordinate space. The well-known Smalltalk Model/View/Controller paradigm is an equivalent way of expressing this relationship.

Implementation decisions fall into the following categories:

For each category, there are various choices, each reflecting a trade-off between performance, resource consumption, and ease of implementation.

Over the years, I've written a number of editors for DOS, Sun, Windows, and Macintosh platforms and have worked with the code for a number of high-end electronic publishing systems. I've also gone over the code for a number of publicly available editors, including Richard Stallman's GnuEmacs, Craig Finseth's Freyja, Dan Lawrence's Micro-Emacs, Jonathan Payne's Jove, and on the DOS platform, Marc Adler's ME, Al Stevens's MemoPad, and Fook Eng's Chi. These implementations span a range of functionality and implementation strategies, from the monumental 1.5 million lines of code in Interleaf 5.0 and the 150,000 lines of Lisp and C code that constitute GnuEmacs, to the medium-weight 25,000 lines of MicroEmacs and 14,000 lines of MemoPad, to the modest 4000 lines of Chi. Each implementation is a unique mix of choices and trade-offs.

The topic of text editors is among the most deeply personal of subjects for many programmers, giving rise to many a flame war on Usenet, CompuServe, and BBS systems. It's no surprise that one of the Internet newsgroups is called alt.religion.emacs. The following survey of techniques tries to be as nondenominational as possible.

Conceptual View of Data

Perhaps the most basic design decision is what conceptual view of text data to present to the user. Most programming editors regard a text file as a one-dimensional array of ASCII characters that gets mapped to a two-dimensional grid of screen positions. An exception is Chi, which treats text as a one-dimensional array of lines, each of which is a one-dimensional array of characters. This means that lines do not wrap at the right edge of the screen, making for a simple and fast implementation--but one that's cumbersome to use when typing narrative text (as opposed to source code).

To maintain the view of text as an uninterrupted stream of characters, you can use a number of data structures, such as linked-list structures, the buffer-gap structure, and virtual-memory blocks.

Plain ASCII text is fine for editing program source code, but other uses require additional attributes to be associated with the text stream. These attributes control the way text is formatted or presented on the screen: typeface, point size, justification, and so on.

One implementation choice is to embed these presentation attributes into the text stream, thereby mixing formatting commands with text content. This approach is used in many older-generation typesetting machines, back before WYSIWYG. In general, this is also the approach that the future version of GnuEmacs will use. Version 19 of GnuEmacs, to be released later this year, will view a file as a stream of first-class Lisp objects that can represent text characters, formatting commands, events such as mouse-clicks, or any arbitrary Lisp function.

A different way to deal with text attributes is to maintain parallel streams--one of content, the other of presentation attributes related to content. This is used in Microsoft Word. Likewise, on the Macintosh, the environment provides an equivalent to the Windows Edit control called TextEdit that allows for "runs" or sequences of text styles, a "text style" being defined as a particular combination of font and point-size attributes. Text styles are implemented as an array of style records that point to locations in the text stream where a particular "run" begins. There's an elaborate figure on page 15-38 of Inside Macintosh Volume VI (Addison-Wesley, 1992) that shows how style records are maintained in relation to the text. In System 7.1, the WorldScript facility generalizes the notion of runs even further.

As you step up from ASCII text editors and simple word-processors to more sophisticated document-processing and desktop-publishing programs, text can no longer be regarded as a one-dimensional array of characters (even when associated with a parallel stream of text attributes). In document processors such as FrameMaker, Interleaf, or Ventura Publisher, the text content has its own elaborate structure -- words, sentences, paragraphs, subsections, chapters, appendices, and volumes. This is known as a "semantic" or "logical" structure, in contrast to the geometrical or visual structure of the presentation. Document-processing programs have the most complex task of all graphical editors, to maintain a consistent mapping between two tree-like structures: the semantic hierarchy of text content and the geometrical hierarchy of pages, columns, frames, and lines. Dealing with this complexity in an interactive, optimized manner is what leads to million-line-plus programs.

The Machine Representation of Text

In the EditLine() example, text is represented in the machine as a single array of ASCII-encoded bytes. Inserting or deleting a character merely requires calling movmem() to shift every byte by one memory location. Depending on the CPU, this brute-force method can work for even medium-to-large amounts of text. At some point, of course, this profligate expense of machine cycles becomes unworkable. Then the "buffer-gap" approach comes into play.

The buffer-gap approach divides the single array of characters into two parts, separated by a movable gap. The gap is an internal construct, not visible to the user. From the user's point of view, text remains in an unbroken stream. As the user navigates over the text, moving the cursor from one character to the next, the system updates the corresponding pointer in the text data structure, skipping over the buffer gap as necessary. When the user enters in a bunch of text, the system shifts the gap over to the point of insertion, then shrinks the gap by one character for each keystroke. This method avoids most of the shuffling and reshuffling of text required by the earlier approach.

Implementing a buffer-gap manager is not difficult, but requires attention to detail to avoid fencepost errors. As Finseth points out, three coordinate systems are in play at the same time. In the user coordinate system, location 0 corresponds to the position before the first character of the text. Note that coordinates label the positions between the characters, rather than the characters themselves. This is similar to a 2-D graphical coordinate system, such as that used by QuickDraw or Windows GDI, which labels the positions between pixels rather than the pixels themselves.

Second, there's the buffer-gap coordinate system, which is the same as the user coordinate system, except that the continuum is broken up by the variable-length gap. The third system is the storage coordinate system, which labels the memory locations where characters are stored (rather than the positions between them) and is the one used by pointers to memory. If you don't scrupulously maintain the distinction between these three coordinate systems, you'll be plagued by an ongoing cascade of fencepost errors. Fortunately, the code available electronically contains bufgap.c, a module that implements all the basic functions for managing a buffer gap--inserting and deleting text, moving the gap, expanding the gap, searching the buffer for a particular string, and so on. Ecerpts from buf_gap.c are shown in Listing Two. This code is heavily based on an example posted by Joseph Allen to the Internet on September 10, 1989. The module is not a stand-alone program, but assumes other modules for input, command dispatching, redisplay, memory allocation, and screen output.

Listing Two

/***************************************************************************
Excerpts from BUF_GAP.C--buffer gap manager module. Derived by Ray Valdes from
code by Joseph H. Allen, who wrote in his post to the comp.editors newsgroup
on 9/10/89: "Do whatever you like with this, just leave my name on it."
***************************************************************************/

private unsigned sizeofBuffer;    /* The size of theBuffer */
private char*    thePoint;        /* The point */
private char*    theBuffer;       /* The buffer */
private char*    theEndOfBuffer;  /* First character not in buffer */
private char*    theStartOfGap;   /* Beginning of theStartOfGap */
private char*    theEndOfGap;     /* First character not in theStartOfGap */
private bool     isBufferChanged; /* Set when file has been changed */

#define SIZEOF_GAP_INCREMENT   16384  /* Amount that the buffgap grows by */

/****************************************************************/
public bool
bg_InitializeModule(void)
{   sizeofBuffer = SIZEOF_GAP_INCREMENT;
    theBuffer = (char* ) mem_AllocMem(sizeofBuffer);
    if(!theBuffer) return FALSE;
    thePoint      = theBuffer;
    theStartOfGap = theBuffer;
    theEndOfGap   = theBuffer + SIZEOF_GAP_INCREMENT;
    theEndOfBuffer= theEndOfGap;
    return TRUE;
}
/****************************************************************/
public void
bg_ExpandBuffer(unsigned amount)
{
   if( (theEndOfBuffer + amount - theBuffer) > sizeofBuffer)
    {   char* old = theBuffer;
        sizeofBuffer = theEndOfBuffer + amount
                 + SIZEOF_GAP_INCREMENT - theBuffer;
        theBuffer = (char* ) mem_ReallocMem(theBuffer, sizeofBuffer);
        if(!theBuffer) ProgramError("ReallocMem failed!");
        thePoint       += theBuffer - old;
        theEndOfBuffer += theBuffer - old;
        theStartOfGap  += theBuffer - old;
        theEndOfGap    += theBuffer - old;
    }
}
/****************************************************************/
public void
bg_MoveGapToPoint(void)
{
    if(thePoint==theStartOfGap) return;
    if(thePoint==theEndOfGap)  { thePoint = theStartOfGap; return; }
    /*else*/
    if(thePoint < theStartOfGap)
    {   bg_MoveBytes( theEndOfGap - (theStartOfGap-thePoint),
                      thePoint, theStartOfGap - thePoint);
        theEndOfGap   = theEndOfGap-(theStartOfGap-thePoint);
        theStartOfGap = thePoint;
    }
    else
    {   bg_MoveBytes(theStartOfGap,theEndOfGap,thePoint-theEndOfGap);
        theStartOfGap += thePoint-theEndOfGap;
        theEndOfGap    = thePoint;
        thePoint       = theStartOfGap;
    }
}
/****************************************************************/
public void
bg_ExpandGap(unsigned size)
{   if(size > bg_SizeofGap())
    {
        size += SIZEOF_GAP_INCREMENT;
        bg_ExpandBuffer(size);
        bg_MoveBytes(theEndOfGap+size,
                     theEndOfGap, theEndOfBuffer - theEndOfGap);
        theEndOfGap    += size;
        theEndOfBuffer += size;
    }
}
/****************************************************************/
public bool
bg_FindNextNewline(void)
{   while(((thePoint==theStartOfGap) ? (thePoint=theEndOfGap) : (thePoint))
            != theEndOfBuffer)
    {
        if(*thePoint==NEWLINE_CH) return TRUE;
        else thePoint++;
    }
    return FALSE;
}
/****************************************************************/
public void
bg_InsertStringAtPoint(char* string, unsigned size)
{   bg_MoveGapToPoint();
    if(size > bg_SizeofGap())
        bg_ExpandGap(size);
    bg_MoveBytes(theStartOfGap,string,size);
    theStartOfGap += size;
    isBufferChanged = TRUE;
}
/****************************************************************/
public bool
bg_CompareString(char* string, unsigned size)
{   char* x;
    if(thePoint==theStartOfGap) thePoint=theEndOfGap;
    if(    (theStartOfGap > thePoint )
        && (theStartOfGap < thePoint + size )
        && (theStartOfGap != theEndOfGap) )
    {
         if(bg_CompareString(string,theStartOfGap-thePoint)) return TRUE;
         else
         {
             x = thePoint;
             thePoint = theEndOfGap;
             if(bg_CompareString(  string + (theStartOfGap-x)
                                 , size - (theStartOfGap-x)))
                  { thePoint=x;  return TRUE;  }
             else { thePoint=x;  return FALSE; }
         }
    }
    else
    {
        x = thePoint;
        do { if(*(x++) != *(string++)) return TRUE; } while(--size);
        return FALSE;
    }
}
/****************************************************************/
public bool              /*this routine assumes file is already open*/
bg_InsertFile(FILE* file)
{   unsigned amount;
    long file_size = filelength(fileno(file));

    if(file_size==0L)      return TRUE;
    if(file_size > 32767L) return FALSE;

    isBufferChanged = TRUE;

    bg_MoveGapToPoint();
    bg_ExpandGap((int)file_size);

    amount = fread(theStartOfGap, 1, file_size, file);
    if(amount != file_size)
    {
        ProgramError("I/O Error on reading file.");
        return FALSE;
    }
    theStartOfGap += amount;
    return TRUE;
}

In contrast to the buffer-gap approach, some text editors use a doubly linked list of lines to represent a text file. As with the buffer-gap, the lower levels of the program shield the higher levels from dealing with the particulars of the machine representation. As regions of text are deleted or inserted, the lines in the linked list are split or joined as necessary. This is used by many text editors written in Lisp or Lisp-like languages. Although written in C, Jove also uses this approach, as does Bill Joy's vi. A variant was used by the Xerox Bravo editor, which kept the text file in unbroken form at the memory location where it was initially loaded; as changes occurred during an editing session, pointers were inserted into the original text to link in inserted text or to skip over deleted text. Upon finishing the session, the complete file was written out by traversing the resulting tangle of pointers.

A limitation of all the above methods is that they don't work well with files too large to fit into physical memory. Even on systems with hardware-supported virtual memory, performance can still be poor. To deal with the size limitation and the lack of hardware virtual-memory support, many PC-based text editors, such as Mince and Epsilon, implement their own virtual-memory scheme in software. In this approach, a file is broken up into fixed-size pages, whose size is a multiple of the physical disk-block size. A working set of pages--as many as will fit--is loaded into RAM, with the remainder kept in a swap file on disk. Pages on disk are swapped in as needed; pages in RAM are swapped out on a least-recently used basis. This software-only, virtual-memory mechanism exists at a lower level than the rest of the editor routines. As with the buffer-gap and linked-list approaches, the low-level coordinate system is rendered invisible to the higher-level routines that do inserts, deletes, and searches. In this case, the low-level coordinate system requires a tuple (a page number and an offset within the page) to identify a location in the virtual text stream.

All these schemes need to be modified in situations where text isn't represented by 8-bit ASCII encoding. For example, Windows/NT uses 16-bit Unicode internally for all text. Such an adaptation would be straightforward. More complex is where the document consists of a stream of variable-length objects or has a tree-like semantic structure. Programs such as Aldus PageMaker use their own object-oriented database manager to store and retrieve these structures.

Incremental Redisplay Algorithms

Given a bunch of stored text, how is the visual representation of it derived? Again, there's a spread from simple to complex. In the simplest case, where there's no word-wrap and no typographic formatting, the algorithm for generating a screen display is trivial: Clear the screen, find the text position that corresponds to the top of the screen, then step through lines of text in the buffer one at a time, outputting each one until you reach the bottom. This algorithm is the 2-D analogue of the one in EditLine(). Unfortunately, it is basically useless for an interactive program, due to intolerable flashing that results from the constant regeneration of the entire screen display. However, a version of the algorithm can be used for generating the initial screen display at program startup.

The next step taken by many authors is to transform this trivial display-generation routine into one for incremental redisplay. The changes are analogous to going from EditLine() to EditLine2(). Instead of doing the entire task each and every time, only the necessary work is done. A number of data structures come into play to avoid constant recalculations and unnecessary output to the screen.

Before deciding which data structures can be used in the time vs. space trade-off, there is a basic decision to make: How extensible should the editor be? Because the choice of editor is so deeply felt, many implementors try to make their editor as customizable as possible. The decision has profound architectural implications, because intermixing command processing with redisplay routines (as shown in Editline2) often renders the editor customizable only by the original author.

To allow for greater extensibility, many authors choose a bipartite architecture in which the low-level routines, including the redisplay, are implemented in a language such as C or assembler, and the high-level commands are implemented in a simple interpreted extension language that can only access the "inner editor" via an API. The extension language of the "outer editor" is often Lisp-like (in the case of GnuEmacs, Brief, Sine, ZMacs, and others), but can also resemble C (in the case of Epsilon, CBrief, ME), Forth (in the case of Final Word), Basic (Microsoft Word), or even Awk (Sage). The user-visible commands (such as "delete a word" or "move forward one sentence") are implemented in the extension language, and cannot directly access the redisplay structures except where allowed by the API.

In this case, the redisplay algorithm needs to be more intelligent--able both to generate a new image from first principles as well as optimally update the existing image using hints left by low-level buffer-management routines.

In mapping text data to screen display, a common technique is to keep a small array of records that maintain the correspondence between text-stream locations and screen coordinates. This text-to-screen map is updated as changes are made to the text buffer. The buffer-management routines can invalidate part or all of the mapping structure, depending on the extent of changes to the text content. The map gets rebuilt as necessary during the editing session and is discarded upon termination. While this approach works well for plain ASCII text editors, it becomes inadequate for more sophisticated word-processing and electronic-publishing programs.

In addition to the text-to-screen map, another useful data structure is one that optimizes output to the physical screen. In the early days of text editors, character-mode terminals were connected via low-bandwith lines to a host computer (as was the case for both time-shared systems and CP/M microcomputers) and much attention was paid to minimizing data transfer to the screen. For timeshared systems, the situation was complicated by the wide variety of terminals that could be connected to the host computer. Some programs dealt with this by keeping track of two screen images: a virtual image that represented the desired display and a second image that represented the currently visible terminal screen. One program went to heroic lengths to minimize the bytes transferred over the communication line, by using a sophisticated dynamic-programming algorithm that calculated the optimal sequence of commands to update the screen, tuned to the particular brand of terminal device. However, users found it somewhat disconcerting, because portions of the screen would jump up and down seemingly haphazardly as the system moved snippets of text around to piece together a complete display, exploiting the terminal's built-in commands for scrolling, insertion, and deletion. The source code to this module was prefaced with the following comment: "This routine is rather complicated. If you read this code and think you understand it, you are very wrong."

Since then, both PCs and workstations have memory-mapped displays, so communications bandwidth is no longer the problem it was. At the same time, the formatting process has become much more complicated. Instead of monospaced fonts and simple word wrap, formatting has become more like typographic composition. A simple linebreak routine has now become a research project. For example, Donald Knuth, in TeX: The Program (Addison-Wesley, 1986) considers his line-breaking algorithm the "most interesting algorithm of TeX," and devotes 40 pages of his book, as well as a journal article, to describing it. Knuth writes: "The line-breaking problem can be regarded as a special case of the problem of computing the shortest path in an acyclic network."

TeX's typographic capabilities are indeed impressive, and Knuth's boxes-and-glue metaphor for page-level formatting is elegant. Nevertheless, the task of formatting in TeX is eased by the fact that it's a noninteractive batch program. This explains why an interactive electronic-publishing program with similar capabilities is an order of magnitude larger than the 60,000 lines of TeX.

Delving into such complexity is beyond the scope of this article, but some general observations can be made. As the formatting requirements become more complex, deriving the text-to-screen map is no longer cheap. The mapping structures are therefore no longer blithely discarded once they have been computed, but are instead kept around on a permanent basis, stored on disk along with the text content (vastly increasing resource requirements). To reduce the lag between the time a keystroke is entered and the time the screen gets updated, some systems take advantage of multithreading, if it is available. For example, in the OS/2 version of PageMaker, the formatting process is a separate thread from the input process. This dynamic thread architecture is the logical extension of the static module decomposition used by singly threaded implementations.

Conclusion

There are so many degrees of freedom in implementing text editors, it's no wonder that there are so many instances, each one unique. As a concrete illustration of this discussion, the electronic version of the listings includes a multifont, mouse-aware text editor I wrote some time ago for Windows. I hope the context presented here will make the 4000 lines of C code in my implementation more understandable.


Ray is DDJ's senior technical editor. He can be contacted at [email protected].

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.