Design

The Windows 2000 Content Index

By Bartosz Milewski, December 01, 2000

The Microsoft Content Index is a search engine integrated into Windows 2000. Bartosz, who was on the Content Index design and implementation team, describes its requirement set, design principles, and technical details.

Dec00: The Windows 2000 Content Index

Big disks and complex pathnames make content indexing a must

Bartosz runs a distributed software company, Reliable Software, and teaches computer graphics at the DigiPen Institute of Technology. He's the author of C++ in Action: From Hello World to Windows Programming (Addison-Wesley, 2000). He can be contacted at [email protected].

It's ironic that under both Windows 98 and Windows NT you can search the Internet faster than your local disk. The Windows Find Files dialog (Figure 1), accessible from the Start menu, offers limited search capabilities and works by physically scanning all files every time you make a query.

If you enter a query into one of the Internet search engines, however, you don't scan all the web sites in the world. That would take forever. What you are using is a content index -- a database filled with data from an enormous number of web sites. This information is organized for very quick access.

Almost 10 years ago, I was given the task of creating a content index for Microsoft. The Internet was still a little-known academic phenomenon -- there was no Yahoo or Alta Vista. Microsoft was interested in adding a local content index to the operating system as part of its "Information at your Fingertips" strategy. This strategy is now almost forgotten, as the Internet has come to dominate the industry. But the need to provide better access to documents stored on local disks is even more pressing than it was a decade ago. Disks are getting exponentially bigger and it's becoming impossible to remember all the names and paths of your documents. The hierarchical file system was a great invention, but it's bursting at the seams.

It took much longer than anticipated, but 10 years later the Microsoft Content Index was finally integrated into the operating system as part of Windows 2000. In this article, I will describe the requirement set, some of the design principles, and some technical details (most of them patented) of the Microsoft Content Index. Technical information is also described in Automatic Text Processing, by Gerard Salton (Addison-Wesley, 1989) and Microsoft patents (see Patent #5,685,003, Method and system for automatically indexing data in a document using a fresh index table; Patent #5,890,147, Scope testing of documents in a search engine using document to folder mapping; Patent #5,926,807, Method and system for effectively representing query results in a limited amount of memory; and Patent #5,832,479, Method for compressing full text indexes with document identifiers and location offsets).

The Requirements Phase

I started designing the content index together with Kyle Peltonen in 1991. (Some preliminary design work had been done by Brian Berkowitz, who then moved to OFS, the late Object-based File System. Among other things, Brian came up with a compression algorithm for indexes, which was later patented in his name; see Patent #5,832,479.) The basic set of requirements was typical of a content index. There had to be a way of scanning disk files to extract words. Those words, together with the information about where they occurred, had to be inserted into an alphabetized index. A user query would then be resolved by looking up words in this index and returning the information about their location. Simple enough.

It's the more detailed requirements that make the design of an indexing service more challenging. For instance, what kind of occurrence information is needed for each word? Is it enough to store the list of documents where the word was found? Or should you also store the list of positions of the word inside each document? We had to vote for the latter if we wanted to be able to resolve more complex queries, such as phrase or proximity queries. If you want to look for all occurrences of the phrase "content index," you must first obtain the list of documents containing the word "content" and another list of documents containing the word "index." You then logically "and" these two lists to obtain the list of documents containing both "content" and "index." But now you have to prune this list by comparing the positions of these two words. Every time the position of "content" is one less than the position of "index," you've found the phrase "content index;" see Figure 2. As you can see, the position of a word is best encoded as its ordinal number within the document, rather than its byte offset from the beginning of the file.

Storing offsets of words makes the index larger, and we had another requirement to fulfill. The size of the index shouldn't be more than 20 percent of the total size of indexed files and, except for extreme situations, should be on the order of 10 percent. That means, for instance, that if you have 1 GB of files on your disk, the content index will most likely fit in a 100-MB file. That's why we needed a really good, custom-made compression scheme. By the way, as we confirmed later, index compression actually speeds up queries. It might sound paradoxical, because decompression during queries requires additional processor cycles. However, it turns out that this time is made up by having to read fewer disk sectors to obtain the same amount of information.

That brings us to another requirement -- speed. We had to make sure that all our data structures were organized in such a way that query resolution would hit as few disk sectors as possible. We also had to design some of the larger in-memory data structures in such a way as not to cause disk thrashing when they didn't fit into physical memory. Since we wanted to allow multiple queries at the same time (especially useful when you put the index on a network server), we decided to create a separate thread of execution for each query. That, of course, led to a whole slew of synchronization issues.

If this weren't enough, we also decided that our content index must be kept up-to-date at all times. The actual requirement was that after some file had been added, edited, or deleted by the user, the change had to be quickly reflected in the content index. In practice, this meant that if you issued a query a second or two after editing a document, the answer should correspond to the new contents of the file. This might seem like a pretty insane requirement and, to my knowledge, nobody else had implemented it. On the other hand, this is the kind of feature that average users would expect, especially since the existing Find Files box always guaranteed reasonably up-to-date results. We were not only forced to have our "robot thread" constantly on the alert to immediately (re-) index files every time they've been saved, but also to be able to override previous information that's been entered into the content index. In particular, after a file has been deleted, its contents should somehow disappear from the content index.

Files that are present on people's disks are rarely pure-text (Notepad-compatible) documents. You might also expect Word documents, Excel spreadsheets, Lotus Notes, and the like. Each of them has a specific internal format and encoding. To index various types of documents, we had to be able to use pluggable filters -- programs that understand specific formats and translate them into pure text. To enable other software vendors to write their own application-specific filters, Microsoft has defined and made available the "filter interface" (http://msdn.microsoft.com/ library/psdk/indexsrv/ixrefint_9sfm.htm).

But Windows is used all over the world, so the content index must be able to work not only with the English language, but also with Russian, Chinese, Japanese, and so on. Russian is easy, as long as we store all words in Unicode. To do case-insensitive queries, we decided to normalize the case of all the letters in every word. For this to work in Russian or French, we had to use a language-dependent normalizer. Japanese, on the other hand, presented a whole new challenge. It turns out that the Japanese don't put spaces between words. A big part of indexing is being able to split text into words. So we had to make our word breaker language dependent. Finally, to save space, we made the decision not to index the 100 most common words, such as the, of, and the like. This so-called "noise list" had to be customized for each language separately.

The list of requirements goes on and on, and I don't remember half of them. Actually, those requirements that were known from the start were reasonably easy to implement. It's the unexpected external changes that caused most of our problems and resulted in such a prolonged development cycle.

The most important requirement was that the content index had to be extremely reliable and robust. It seems like an obvious thing to ask of any kind of software, especially if it has the functionality of a database. But apparently it's not so obvious -- witness the frequent failures of the Windows 98 registry. We knew from the beginning that our content index must be able to survive any type of external failure that doesn't corrupt the file system, including sudden power failures. The index should remain uncorrupted, and information should not be lost. In particular, if a file has been edited and the power failure interrupts the updating of the index, that file should be scanned again after the computer has been restarted. Building that degree of robustness into a system requires religious use of transactions at all levels.

The Design Phase

One of the first design problems we faced was how to make quick updates after (re-) scanning a file without having to rewrite the whole index. Because an index is a compressed, alphabetized list of words (with attached occurrence information), adding new data about a file would almost certainly result in a multitude of updates scattered all over the index. The new or edited file will most likely contain words from all parts of the alphabet. Also, such updates would have to temporarily take the content index out of circulation. All concurrent queries would have to be suspended and then restarted again. This doesn't look like a sensible approach.

A better approach is to allow multiple concurrent indexes, all of them capable of taking part in the query resolution process. A list of query hits from one index could be combined with the list of hits from another index, and so on. How does this help us with our problem? When a file (or a group of files) is saved, we can quickly scan it and produce a miniature index. Once this index is made available to queries, hits from it will be incorporated into their results. Such a quickie index is called a "wordlist" and it's kept in memory with virtually no compression.

So far, so good, but now we face another problem. We know how to add new data to the content index, but how can we remove old data? After a file has been edited, we want the queries to reflect only the new version of the file, not the old one. If the file contained a certain word, and now it doesn't, we don't want to show it as a positive hit to the user. File deletion poses a similar problem -- we don't want users to see hits from a non-existing file.

This turned out to be one of those problems with a solution that resulted in a software patent. The trick is to keep a "fresh list" of entries and store the information about which of the many indexes contains the freshest data for a given file. When we get a hit from a particular index, we check the document with the hit against our fresh table. If that index turns out not to be the freshest one for this document, we discard the hit. For a deleted document, we created a dummy fresh-list entry that didn't match any index.

Wordlists are quick to create and quick to query because they are kept in memory. On the other hand, because they are in memory, they don't survive a computer shutdown. So, at some point, we have to convert wordlists into persistent, on-disk indexes. We usually wait until we have a bunch of wordlists and merge them into a more permanent (compressed) index. In the same way, when we have too many persistent indexes, we merge them into a master index. If you don't make any changes to your documents for about a day, all your indexes will end up in a single, large master index. What makes things interesting is that all this activity must be performed in the background, without interfering with the ongoing queries.

A big part of the content index is related to resource management. Every query increments the reference count of every index it uses, including wordlists. As long as an index has a nonzero reference count, it can't be deleted. If fact, an index can only be modified during its creation. At all other times, the access to indexes is read-only, up to the moment when they are deleted. So, after merging wordlists or other indexes into one larger index, the source indexes are marked for deletion and no new queries may use them. However, as long as some older queries still hold a reference count in them (they are still retrieving hits from them), they cannot be physically removed. Once all the ongoing queries die out, the old indexes disappear.

This is all true, except for the largest index -- the master. When there's time for a master merge (that is, when all indexes are to be merged together into the master), we must do something more involved. Remember that the master index might be as large as 10 percent of your disk. Were we to grow its duplicate during the master merge, the demand on disk space would double. That was definitely not acceptable. We had to invent an algorithm that would merge the master index in place. Moreover, we couldn't afford to stop all querying during the master merge. So a partially merged master index should be as queriable as ever.

For instance, suppose that the master index has already been merged with 10 other indexes in the span ranging from A through F. At this point, the user may make a query for the phrase "content index." The first word of this query, content, will be resolved using the new part of the master index. However, the data for the second word, index, will be retrieved from the old part of the master index, together with the 10 other indexes. This data must then be combined seamlessly into an absolutely correct result of the query.

The Implementation Phase

From the beginning of the project, we had no doubt that we'd be programming in C++, employing the more advanced features of the language whenever possible. One of them was the use of C++ exceptions. There was, however, one little hitch -- the Microsoft compiler didn't support this feature. We could use the C-style structured exceptions, but we had no support for stack unwinding or construction rollback. We bravely decided to create an elaborate system of macros to implement the C++ exception model. Some of that technology percolated to the languages group, and for some time, it was an important part of MFC.

The use of exceptions went hand-in-hand with the development of the resource-management methodology. (I started working on resource management around 1990 and described some of the techniques in the articles "Resource Management in C++," Journal of Object-Oriented Programming, March/April 1997, and "Strong Pointers and Resource Management in C++," C++ Report, September 1998 and February 1999.) The content index was the only component of Cairo that had no memory leaks throughout the development. Resource management also helped us ensure the robustness of thread synchronization and deadlock avoidance.

It is standard procedure at Microsoft to try to reuse ready-made components to build new applications. One such component that seemed like a perfect store for our indexes was Microsoft's own database engine, code-named "JET." (This was an example of cross pollination between the systems division, the content index, and the applications division, JET. Another one, which actually panned out, was the application of our technology in the Microsoft Office Find Fast engine.) It soon became obvious that there was a big "impedance mismatch" between the content index and the JET database. In particular, the content index was supposed to be small, fast, and reliable. At that time, JET was anything but that. We managed to convince our managers to dump JET and build our own storage layer from scratch. This turned out to be a very good decision.

Another reason for getting away from JET was the misbegotten decision to move the content index to the kernel of the operating system and tie it exclusively to the new OFS filesystem. I must confess that as soon as I became the development lead for the content index, I fought this decision as much as I could. We forked the development into two streams -- one was kernel-mode, OFS-based, and the other was user-mode, filesystem independent. How we managed to defend this subterfuge from the management is a separate story. Suffice it to say that this decision saved the life of the content index when the big ax fell on Cairo, and on OFS in particular.

The Selling Phase

Most Cairo developers had no idea what was going on behind the scenes. In December 1993, with great fanfare, Cairo was announced at the Win32 Professional Developers' Conference as the next Windows NT (see "Microsoft's New Object File System To Be Enterprisewide Foundation Of Cairo," by Eamonn Sullivan and Larry J. Seltzer, PC Week, December 20, 1993). That was the official declaration of war between the two groups inside Microsoft. Windows NT was the brainchild of Dave Cutler. Jim Allchin had Cairo under his command and was making big inroads into the NT kernel. OFS was supposed to make NTFS obsolete. Cairo's distributed filesystem was eating at the networking model of NT. The content index was invading the kernel mode. The showdown was imminent.

Suddenly, Allchin made a masterful gambit. He sacrificed Cairo by giving complete control over it to Cutler. In exchange, he got control of the operating-systems group, thus becoming Cutler's boss. Cutler didn't waste any time. He started dismantling one Cairo team after another. Gary Kimura, one of the architects of NTFS, became the manager of OFS. Soon afterwards OFS was canceled, even though it was very near completion.

Axing the content index wasn't that easy, though. First of all, we had a user-mode version that was independent of the filesystem. Second, we created a content index of all NT sources, and it quickly became indispensable in the development of NT. You have to keep in mind that kernel developers had no access to higher level tools, such as code browsers or even visual debuggers. All they had was command-line compilers, makefiles, and kernel debuggers. With the content index, they were suddenly able to find all the uses of a variable or a function, jump directly to a definition of a structure, and so on. And finally, from the technical point of view (and this is not just my opinion), the content index was amazingly well written, reliable, and robust. The problem was that nobody knew what to do with this piece of technology.

After I left Microsoft in 1995, the content index group moved away from operating systems and joined the Internet group. Soon afterwards the content index, renamed "Index Server," was shipped as part of the IIS (Internet Information Server). It was used as a search engine for indexing web sites. The whole Microsoft web site has been indexed using our content index. (You might have noticed that it's possible to formulate quite complex queries -- Boolean, proximity, vector, and so on -- when searching the Microsoft web site.)

Finally, after five years of banishment from NT, the content index has been introduced into the distribution of the Windows 2000 operating system. Its functionality is still virtually identical to what was available five years ago. Unfortunately, so is the user interface. You might have Windows 2000 running on your machine and have no idea that the content index is there. If you look closely at the Find Files dialog, under Search Options, you will notice a link to Indexing Service (see Figure 3). If you follow this link, you'll be asked if you want to enable the indexing service (do it!).

Even after enabling indexing and letting the content index scan your files, it is not at all obvious how to perform queries. You'll have to look hard to find any help on that topic. Figure 4 is an excerpt from the Windows 2000 help file that explains how to find the query form from inside the Control Panel. It could hardly have been made more obscure.

The query form is constructed as a reasonably simple dynamic HTML page that can access the content index through a Basic script. Figure 5 shows the form with the results of the search for "content index." Once the content index is up and running, this query takes virtually no time.

What makes the content index even less useful for developers is that Microsoft has not provided a filter for C or C++ files. So if you don't specifically enable the indexing of files with unknown extensions (and you should), you won't see your sources among query results.

Conclusion

Despite the horrible user interface, once you know how to set up your catalogs and query forms, you'll find the content index truly indispensable. You'll be able to instantly find files with contents you can only vaguely describe. Or, alternatively, you'll be able to issue very precise queries that pinpoint a particular function definition in your code or a declaration of some data type in windows headers. (Quite recently, Oracle announced its own revolutionary substitute for a filesystem -- a database of files called Oracle iFS, Internet File System. One of the main advantages of storing files in a database is the possibility of finding them by their contents. What I find ironic is that the functionality considered revolutionary by Oracle has been so casually swept under the rug in Windows 2000.)

Since the content index was designed to be extensible, programmers are free to enhance its capabilities by writing their own filters for specific file types. A custom filter may not only output the text of the file, but also create a set of properties, or database fields that can be queried independently or together with the contents. The content index may be configured and accessed from inside other applications to allow for fast searches. Finally, using dynamic HTML, you can create custom query forms for particular tasks.

DDJ

1 2 3 4 5 6 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

The Windows 2000 Content Index