Chris Newcombe is a senior engineer with 20 years of experience. He designed and wrote a large part of the back-end server architecture for Steam.
This article describes some of the lessons that my coworkers and I learned while building Steam, a broadband platform for direct software delivery and content management, which is owned by Valve Software. Poor system or component architecture can cripple even the best database engine. We chose Windows 2000 Server as the back-end OS, partly because of our development team's experience with that platform, and partly because of I/O completion ports. This Win32 API provides a highly efficient, event-based asynchronous I/O mechanism, driving a small pool of worker threads instead of a thread-per-connection. Such a design can serve tens of thousands of connections with very little context switching. High-latency operations such as database access and cryptography do mean that you need a larger worker pool than for just a simple web server, but it is still an extremely efficient concurrency scheme.
We recognized from the start that Steam [1] would need a wide variety of server functionality. For that reason, we first implemented a generic server framework that let us quickly deploy new server behaviors in the form of pluggable connection-handler state machines. This architecture has resulted in huge wins in system flexibility and allowed rapid development by our small server engineering team (two engineers). Steam is built on top of Berkeley DB, a high-performance transactional database engine from Sleepycat Software [2]. In this article, I describe the main engineering lessons that we learned while building Steam.
To begin with, Berkeley DB imposes no structure on stored data, so the application is free to do whatever it likes. For instance, your application can be SQL-like, where each database record is a set of typed fields (a "row") and each field belongs to a conceptual "column." Column types may be hardwired into the application or described in another table (metadata). Alternatively, the application can exhibit object-oriented persistence, in which each database record is a serialized language object, perhaps formatted according to ASN.1, XDR, YAML, and so on. A third option is to use native XML, where each database record is an XML document. Finally, a fourth option is support of an object-oriented data hybrid model, where each database record is a serialized data container. The corresponding language objects have no functionality other than data access/navigation and serialization.
Option 1 results in a lot of SQL-like query patterns, which usually must access multiple tables (joins and subqueries). These typically require a lot of indexes to be even remotely efficient. In fact, for any real application, you would probably need an automatic index creation and maintenance system. You'd need a scheme to impose typed columns on Berkeley DB's raw binary records, and you would probably need the capability to add/drop columns. By that point, you would probably be better off using a real SQL layer, as reinventing SQL is not an efficient use of your time. (If you're bent on using SQL, take a look at [3].)
Persistence for real C++ objects (Option 2) is decidedly nontrivial. The code for the object's behavior cannot be stored, so it doesn't offer true OO persistence anyway. Unless you are using a language with a class-reflection mechanism, you'll need serialization methods or a source-code preprocessor to generate the serialization code. You also need to decide if you serialize all data members of live objectseven members that are purely transient at runtime.
You'll also have to decide how to deal with interobject pointers and referencesa nasty situation at best. Do you serialize all interconnected objects at once or just individual objects? If you choose the former, you'll probably have to do a topological sort on the network so you can allocate serialized IDs to use as references. You also have to decide whether and how to support cycles in the dependency graph. You must then work out how to index it all. If you serialize single objects, you'll need to maintain some secondary directory system for objects to find each other. There are various established solutions to these problems, but they're often restrictive, complex, inefficient, or all three. Also, the object definitions in the source code can and do change frequently, so there is a high-maintenance serialization/mapping layer to deal with. You'll also need intelligent versioning and upgrade functions to handle old objects. In general, this technique causes the database to become closely coupled to the code. Since one major reason for SQL's great success is the fact that it breaks code/database coupling almost completely, reversing that trend might not be a great idea.
Option 3 (Native XML) has been growing in popularity due to the push to make everything on earth a web service, and the ubiquity of XML in gluing the planet back together again. As luck would have it, Sleepycat developed a native-XML layer on top of Berkeley DB. This tool provides an interface to store and retrieve arbitrary XML documents and perform generalized database queries using the standard XPath and XQuery languages.
The downsides include: significantly larger database recordsXML is extremely verbose; overhead of parsing a lot of XML in both the database XML layer and the application; and overhead of maintaining indexes on many parts of the stored documents (fortunately, the application may declare only the indexes it thinks it needs): overhead of parsing, compiling, and executing XPath queries instead of direct record access APIs. A cost-based query optimizer is included to reduce the impact. You can also cache frequently used queries in parsed and compiled form, so you pay the overhead only once.
The upside is a large degree of flexibility, especially for broad or poorly defined data models. Another major advantage is rapid prototypingwe really wish this feature had been available when we were prototyping Steam.
We settled on the fourth optionOO/data hybridwith good results. As used in Steam, OO/data hybrid means defining C++ objects that are purely data containers (for instance, Steam has a CAccountRecord class). These objects have no true OO behavior or responsibilities beyond serialization, integrity checks, data-access methods, and optional XML export/import (which is just a different form of serialization, of course). The application often declares real behavior objects that somewhat parallel these record objects (Steam also has a CAccount class), but the data containers are much simpler and change much less frequently than the behavior objects.
There is no need to handle interobject pointers and references because you are not attempting to persist live objects. You still get to design a data model that is substantially decoupled from the code that uses it, and you can decide explicitly how different entities refer to each other (usually via numeric indexes or strings). You can then build exactly the set of database indexes that you need.
This option constitutes a middle ground between OO-persistence and complete code/data separation. It works well for us. It does result in some tedious work maintaining the integrity-check methods and XML parsers, but that is almost all boilerplate code that could be generated from some fairly simple data-description language, which we hope to have time to write one day.
We implemented a generic base class, called CMultiFieldBlob, for these data containers. (BLOB is short for "Binary Large Object.") This class manages a single chunk of malloced data as a set of variable-length, arbitrary (typeless) key-record pairs. It maintains a fast index of these fields and handles memory management when fields grow and shrink (some spare capacity is kept at the end of the malloced image to reduce the frequency of realloc when data is inserted). The application's various Record classes simply derive from CMultiFieldBlob and add enums or constant strings for field names, plus the appropriate data-access methods.
A critical feature of CMultiFieldBlob is that it can be composed hierarchicallyany field in a CMultiFieldBlob can itself be another serialized CMultiFieldBlob image. All child blobs live inside the top-level blob's malloced image, so no extra heap allocations or data copying are required (very important for a server, to reduce heap contention). The resulting arbitrary tree can be navigated quickly and easily by "opening" CMultiFieldBlob objects on fields. If a field is inserted or modified in a way that causes images for other "open" blobs to be moved in memory, then those "open" blob objects are automatically notified. Add some utility methods to compress or encrypt levels of the tree, and you have a versatile container for arbitrary data.
The result is blindingly fast because there is almost no serialization or deserialization processing to doyou can shove a copy of the malloced data into Berkeley DB any time you like. When retrieving data, Berkeley DB hands you a malloced record that is immediately useable as a CMultiFieldBlob (record) image. As soon as you've indexed the top level, you can start reading and manipulating it. You don't index the full tree of fields, just the levels you are working on at any given moment.
For important data, you can implement a ValidateFields() method that is called before serialization and after extraction from the database. This validation method recurses down through all fields checking field names, sizes, and value ranges. It provides a fast integrity check at the gateway to the database. You can also do an automatic version upgrade of old objectsadding missing fields with default valuesat the same time. The only missing feature is cross-database reference checking (SQL's integrity constraints). The ValidateFields() method could perform database accesses to do this, but at major performance cost and with potential cyclic recursion. Instead, Steam detects and handles cross-database integrity violations in the application logic. However, these should never happen (and we've had none so far) because of Berkeley DB's transactional guarantees.
The absence of a persistence/mapping layer makes transactions extremely efficient. You can read, arbitrarily manipulate, and write an entire CAccountRecord with just one database get/put operation and a single page lock. Chunking data into units like this therefore minimizes page-lock contention and deadlocks in the database.
Steam chooses to split account information into separate CAccountRecord and CAccountUsageRecord objects, each of which live in separate physical database files. The former holds important data that rarely changes, such as billing information, passwords, and purchased subscriptions. The latter stores transient history used to prevent fraud and assist technical support in diagnosing service problems. Many operations (login, for example) only modify CAccountUsageRecord, so we still minimize deadlocks while keeping the important CAccountRecord database relatively stable. Another advantage is that objects of type CAccountUsageRecord are typically much larger than objects of type CAccountRecord because the former accumulate history information. If necessary, the CAccountUsageRecord information can be purged to reclaim disk space without affecting users.
The hierarchical nature of record objects is a natural mapping structure for XML export/import to other systems, and to our regression test framework. The fact that XML is human readable is also a big help in the early days before you have written all your tools. The support for XML in many scripting languages, code libraries, and off-the-shelf tools makes it a good choice for passing around chunks of structured data. However, we only use it as a data-exchange format. Inside the database, everything is serialized as tight, raw binary.
(If you are using XML in C++, check out Andy Dent's expatpp wrapper around James Clark's famously fast expat parser [4]. The best feature of this wrapper is the expatppNesting class. This feature lets you define completely separate parser classes for child records and then invoke those parsers to handle the appropriate nested elements of large documents. We have found this feature to be essential to managing complex record hierarchies. In fact, we fixed a minor bug and added some optimizations to expatpp to make this feature even better. You can get our version of the code from [5].)
Indexes
Indexes are simply additional databases that happen to map some useful key to primary keys of another table (an e-mail address mapped to an account name, for instance). You add code to update any index tables whenever you insert or remove elements from the main table, as part of the same transaction to ensure that consistency is not violated. Berkeley DB includes a Db::associate API to establish such relationships and automatically update the index when you modify the primary database. However, sometimes it isn't possible to use this API. For example, Steam incorporates a many-to-one mapping operation from e-mail addresses to account records that Db::associate cannot handle.
To minimize deadlocks, you should try to access databases in the same order (primary, index1, index2, and so on) every time. Berkeley DB has flexible mechanisms to handle deadlocks automaticallyall the application sees is that occasionally transactions fail with a "selected to resolve deadlock" return code/exception, and the application can usually retry the transaction and succeed. But you still want to minimize this.
Wrappers
The Berkeley DB API is direct, clean, and simple, but Steam does not actually use it directly. Steam is written in C++, and there are huge benefits to using thin, templatized wrapper objects around the native API, including:
- Compile-time type safety of serialized objects. The Berkeley DB API is great for storing and retrieving typeless data. However, if you have chosen to serialize C++ objects, it can be important to have the same degree of type safety with database operations as any other object operation.
- Convenient configuration. Berkeley DB offers a staggering degree of flexibility via its many configuration options. Many of these have restrictions on when they can be used (during initialization, during live operation, and so on). The interfaces for these are well designed and the documentation is excellent, but you still have the problem of managing it all in your code. It is hard to keep track of all the configuration parameters if they are scattered throughout your application. It is better to gather all configuration code into a few controlled places.
- This is an ideal use of policy classes, as described in Andrei Alexandrescu's Modern C++ Design. This technique encourages you to put a general interface on configuration and related issues such as error message logging. You then plug a configuration class into your wrapper classes. Databases with different configuration classes are different types, which also aids type safety.
- Apply the Singleton pattern to database and environment wrappers. Databases are obvious Singletons so this is common sense, but there are other benefits to doing this.
- C++ exceptions for elegant, robust error handling. Our team had many years of experience developing in C. We were initially wary of exceptions, having read dire stories of problems with this language feature. So we did careful research before deciding to adopt this technique. The result has been another massive win, both for robustness and productivity. Error-handling code now almost writes itself because it is mostly absent (hidden in stack-unwinding and object destructors). In fact, the combination of exceptions, Design by Contract, and automated regression testing has resulted in the largest step-function in project quality and team productivity that I have ever seen.
The choice of error-handling mechanism may seem irrelevant to database issues, but it is not. Databases apply transactions to data, and C++ exceptions can be thought of as applying transaction principles to codefrom the statement level up to the entire application. When the code in question is performing database transactions, the two concepts cooperate beautifully to allow extremely elegant, efficient, and correct code.
Before adopting exceptions, you should read Herb Sutter's Exceptional C++, ensure that you know the STL, and find a good reference-counted smart pointer class (check out shared_ptr [6]). You should also read about and download the terrific ScopeGuard class [7], which is practically essential. The article describing ScopeGuard is a great introduction to considering code and error handling as transactions. Also see Herb Sutter's article on ACID programming [8].
We found that the following wrapper scheme works well:
- Write a thin CDatabaseEnvironment wrapper and templatize it over a CDbEnvConfigPolicy class. Make both of the classes Singletons. Also, you may have some utility code that manages database environments without caring about their template parameters (that is, the communications infrastructure for replication). You want to be able to pass a base-class pointer to such generic code, so have both of these templates inherit from nontemplate, abstract base classes (interfaces).
- Write a thin CDatabase wrapper and templatize it over a CdbConfigPolicy. Also, templatize it over the Key and Record classes that the database will contain. This prevents you from accidentally serializing the wrong type of record object into a database or from misinterpreting the data extracted from the database. For those "gotta cast, I know what I'm doing" moments, you can use the low-level native Berkeley DB API (possibly with your CMultiFieldBlob-like base container). The fact that you are doing so will justifiably draw special attention to that code.
- The Key and Record classes can be any objects that implement a simple serialization interface. In Steam, that interface looks like Example 1. In addition to CMultiFieldBlob, we have other helper classes, such as CSingleFieldKeyOrRecord<typename T>, that implement this interface. This class can be used directly with all POD types (that is, unsigned long for numeric keys/records), and nonPOD classes via template specialization. For example, Steam has:
typedef CSingleFieldKeyOrRecord< std::string > CAccounKey;
- Write a thin CCursor wrapper and templatize it over the particular instance of the CDatabase<Key, Record, DbConfigPolicy> class with which the cursor is used. The cursor's interface can then refer to the correct key and record classes via nested typedefs declared in the database wrapper class. This again ensures that you cannot extract or store the wrong type of data. This wrapper is also critical for exception safety.
- Write a thin CTransaction wrapper and templatize it over the database environment wrapper. This prevents transactions from being misapplied to operations on databases that live in different environments. Berkeley DB detects this, but why wait for a runtime error when you can prevent it at compile time? This is another wrapper that is critical for exception safety.
The result of using wrappers is that application code can and should look as clean as Listing 1. Such code is correct and safe in the presence of any possible problem without requiring any specific error-handling code.
Listing 2 presents suggested partial interfaces for the CDatabaseEnvironment, CDatabase, ConfigPolicy, and CCursor wrappers. Listing 3 is an implementation of CTransaction. Listing 4 provides example declarations for real environments and databases.
Berkeley DB requires that only a single thread close a database environment handle. Opening a database environment is trickier as it often involves running a "recovery" operation to bring the environment into a consistent state in case of a previous application crash. If recovery is necessary, it may only be performed by one thread in one process, and no other processes or threads may open or attempt to open the environment while this happens.
The easiest way to meet the first requirement is to open the databases before any worker threads are created and only close them after all worker threads have exitedSteam does this. Alternatively, you could protect your entire Instance() method with a static mutex. (Note that the famous "double-checked lock" pattern for Instance() function has been proven to be unsafe on some architectures unless explicit memory barriers are added.) The two major problems here are:
- Where and when do you initialize the mutex? The mutex is basically another Singleton.
- Databases are highly concurrent, but a mutex on every Instance() call causes a lot of unnecessary serialization of access. This problem can be somewhat mitigated by the next recommendation.
We found it extremely useful to have the Instance() accessor method to return a thread-safe, reference-counted pointer instead of the usual C++ reference. This solution allows fine control over closing all databases and environments when you choose, rather than during atexit() processing after your main() function has exited. To do this, let your Instance() method take an optional behavior flag, bRelinquishPrivateReference. If this flag is set, reset the private static reference owned by the Singleton. The flag argument should default to false, so that normal code doesn't need to worry about it.
(A thread-safe, reference-counted pointer uses atomic instructions, InterlockedIncrement() on Win32, to maintain the reference count. In this way, multiple threads may all refer to the same allocated object via their own private copies of the reference-counted pointer. Individual smart pointer objects are not safe for unrestrained access by multiple threads, but taking and destroying copies of them in separate threads is safe and efficientno mutex.)
The easiest way to meet the second requirement (exclusive access for the recovery process) is to use a global mutex, perhaps named after the absolute path of the database environment directory. Alternatively, you could use a coordinator process that spawns all child processes after recovery has finished. To keep the process architecture as simple as possible, Steam uses the global mutex approach. All applications that could run recovery on the Berkeley DB environment must use the same mutex, and that is achieved automatically when they all use the same wrapper classes.
Be warned that Berkeley DB's C++ API does not make the best use of exception safety. Two of the fundamental objects exposedDbTxn (transaction handle) and Dbc (cursor handle)are unsafe and potentially dangerous in the presence of exceptions. Fortunately, this problem is easy to rectify by adding a tiny piece of functionality to the application-level wrappers just mentioned. The transaction wrapper must simply ensure that abort is called in the wrapper destructor if neither commit nor abort has yet been called. A similar wrapper for cursors ensures that close is called if the application has not explicitly done so. This makes transactions and cursors safe for stack unwinding.
The transaction and cursor wrappers do not hold references to the databases that they are operating on. This behavior is purely for performance optimization and is generally safe because the Singletons keep the databases open unless explicitly relinquished. To make things completely safe, CCursor should keep a reference-counted pointer to the CDatabase it is navigating, and CTransaction should keep a vector of reference-counted pointers to the CDatabases that it is accessing (whether access is direct or through a cursor). However, these precautions would incur a lot of overhead for very little practical safety.
(The code listings are somewhat simplified for space. In debug mode, the full wrappers cooperate to maintain a count of open cursors in a transaction, as finalizing a transaction with open cursors is illegal and indicates an application bug. Berkeley DB catches such problems and gives a fatal error, but you get better diagnostics with asserts.)
The wrappers described here do not cover all possible cases. When using nested transactions in such a way that the parent may be committed or aborted before the childor prepared, if it is a distributed transactionthe child is forbidden from calling abort or commit itself as the child's state was implicitly decided by finalizing its parent. Therefore, the destructor of the child transaction wrapper must somehow detect if its parent(s) have been finalized. Berkeley DB provides no API to detect this condition. To solve this, the wrapper objects would need bidirectional pointers between parents and children so that the act of parent finalization can be propagated to all related children. However, most applications are unlikely to need out-of-order finalization of parents and children. The current wrappers used in Steam keep count of unresolved child transactions and simply forbid out-of-order finalization with asserts.
Return Codes & Exceptions
The Berkeley DB C++ API throws exceptions for most, but not all, error conditions. Some return codes are not converted into exceptions, as they are considered to be valid results. The two most obvious examples are "key does not exist" during a database read and "key already exists" during a database write.
This is a reasonable API, but it leaves the application still having to remember to test return codes to ensure correct behavior. Fortunately, an application usually has some contextual expectation of whether a key/record already exists. By adding a slightly richer interface to make this expectation explicit, you can arrange for exceptions to be thrown in all unexpected cases. This improves code clarity and eliminates any errors due to forgetting to test return codes.
For this reason, the CDatabase::GetRecordBelongingToKey() method returns a new object by value, constructed from the record read from the database. (Almost all compilers implement the Return Value Optimization to eliminate the extra copy operation that this appears to require.) This means that an exception is the only way to signal failure, so we now throw an exception for "no such key," or if the record was found but failed to pass its Validate() check; see Example 2(a).
In reality, almost all transactional operations are done via cursors to avoid the cost of multiple searches for get and put phases and of acquiring page locks twice. So the cursor wrapper has similar methods called MoveToKeyAndGetRecord(), MoveToPositionAndGetRecord(), and so on.
We found that this single idiom (expecting the record to exist) suits almost all application code, at least when considering the database as a set of serialized objects. To support cases where the record might legitimately not exist, we add another interface to the wrapper that does not throw this particular exception; see Example 2(b). The name of this method makes it very unlikely that you'll forget to check the return value, and there are only two possibilities, so no error-prone switch statement is needed. Again, the programmer's intent is much clearer. The cursor wrapper has similar interfacesAttemptMoveToKeyAndGetRecord(), and so on.
The SwapNoThrow() method is a vital tool for exception safety (see the discussion of assignment operators in Sutter's Exceptional C++), but the method is also useful for moving object contents between placeholder objects in different scopes. The aforementioned receiver could have been an std::auto_ptr<CAccountRecord>, but that would require an extra heap allocation for the CAccountRecord. Steam's CMultiFieldBlob class has a fast default constructor (does no heap allocation), so the SwapNoThrow() idiom is an efficient, elegant choice when you expect to find a record but want to use a default only if the record is not found.
Berkeley DB's Native DbException Class
Berkeley DB's C++ API throws instances of its own exception class DbException, which unfortunately is not derived from std::exception. This is apparently done for compiler portability reasons, but it doesn't sit well with most application exception handlingyou have to add specific catch clauses for this new type. Our wrapper classes catch all DbExceptions and rethrow them as types that are derived from std::exception.
Another reason to do this if you're using Visual C++ 6.0 is that the compiler has a code-generation bug that causes thrown exception objects to be destroyed twice if they are caught by catch (...) and rethrown. DbException causes a crash if it is destroyed twice (hardly Sleepycat's fault!), but our own exception base class was written to be safe for double destruction to work around this bug.
At this point, we have wrapped most of the core database API. Unfortunately, the Berkeley DB C++ API is providing little benefit nowin fact, the catch/rethrow business is very inconvenient and inefficient. So you might consider implementing your wrappers in terms of the native C API instead. The Berkeley DB C++ API is just a thin layer on top of the C API that throws exceptions for some, but not all, error codes. Your wrapper classes can do the same.
References
- [1 http://www.steampowered.com/.
- [2] http://www.sleepycat.com/.
- [3] http://otl.sourceforge.net/.
- [4] http://www.oofile.com.au/xml/expatpp.html.
- [5] http://www.valvesoftware.com/legal.htm.
- [6] http://www.boost.org/.
- [7] http://www.cuj.com/documents/s=8000/ cujcexp1812alexandr/.
- [8] http://www.gotw.ca./gotw/061.htm.