Dennis is a professor of computer science at New York University and a DDJ contributing editor. He can be contacted at [email protected]. Philippe is an assistant professor at the University of Copenhagen and can be contacted at [email protected]. They are also the authors of Database Tuning: Principles, Experiments, and Troubleshooting Techniques (Morgan Kaufmann, 2002).
Consider the following puzzle. Say you have two databases, X and Y. Heretofore, each has been an island of data serving its own department. For example, X records inventory data to be used for restocking various warehouses and Y contains sales data saying what should be shipped where. Now you want to tie the databases together to improve shipping. In particular, you want certain data from X to be postprocessed in Y shortly after it enters X. You are allowed to add or change transactions in X and Y. However, neither department wants to allow distributed transactions that use two-phase commit. So any process that moves data from X to Y consists of transactions, each of which applies to X alone or to Y alone. You want to avoid losing data from X and you want to avoid double processing the data on Y, even when there may be failures of X or Y or both. How do you do it?
Whether or not you are ready to solve this puzzle, you will recognize the situation islands of data need bridges between them. It is relatively easy to build a solution that usually works when little data goes through. It is much harder to solve the problem in a fault-tolerant way and at high speed. The same holds for the apparently simpler problem of connecting between a programming language and a single database. In fact, we'll discuss that form of smooth talking first.
Tuning the Application Interface
In our book Database Tuning: Principles, Experiments, and Troubleshooting Techniques (Morgan Kaufmann, 2002; ISBN 1558607536), we discuss a set of principles backed by experiments for tuning the interface between databases and application programs. Here we focus on the recurrent problems, beginning with the mapping of objects in the application program to tables in the database.
Object-oriented encapsulation lets the implementation of one class be modified without affecting the rest of the application, thus contributing greatly to code maintenance. Encapsulation is sometimes interpreted as "the specification is all that counts." That interpretation can lead to horrible performance.
The problem begins with the fact that the most natural object-oriented design on top of relational databases is to make records (or sometimes fields) into objects. Fetching one object then translates to a fetch of a record or field. So far, so good.
But then, the temptation is to build bulk fetches from fetches on little objects (the "encapsulation imperative"). The net result is a proliferation of small queries instead of one large one.
Consider, for example, a system that delivers and stores documents. Each document type (a report on a customer account, for instance) is produced according to a certain schedule that may differ from one document type to another. Authorization information relates document types to users. This gives a pair of tables of the form:
authorized(user, documenttype)
documentinstance(id, documenttype, documentdate)
When users log in, the system should say which document instances they can see. This can easily be done with the join:
select documentinstance.id, documentinstance.documentdate
from documentinstance, authorized
where documentinstance.documenttype = authorized.documenttype
and authorized.user = <input user name>
But if each document type is an object and each document instance is another object, then you may be tempted to write this code:
Authorized authdocs = new Authorized();
authdocs.init(<input user name>);
for (Enumeration e = authdocs.elements(); e.hasMoreElements();)
{
DocInstance doc = new DocInstance();
doc.init(e.nextElement());
doc.print();
}
This program first issues one query to find all the document types for users (within the init method of the Authorized class):
select documentinstance.documenttype
from authorized
where authorized.user = <input user name>
and then for each such type t issues the query (within the init method of the DocInstance class):
select documentinstance.id, documentinstance.documentdate
from documentinstance
where documentinstance.documenttype = t
This is much slower than the previous SQL formulation. The join is performed in the application, not the database server.
Our point is not that object orientation is bad. Encapsulation contributes to maintainability. No, the point is that programmers should keep their minds open to the possibility that accessing a bulk object (a collection of documents, for example) should be done directly, rather than by forming the member objects individually and grouping them into a bulk object on the application side. Figure 1 illustrates the performance penalty of looping over small queries rather than getting all necessary data at once. This graph compares two programs that obtain 2000 records from a large table (lineitem from TPC-H). The loop program submits 200 queries to obtain this data, while the no loop program submits only one query and thus displays much better performance.
Object orientation is not the only programming habit that can lead to bad performance. Programmers who have grown up with programming language loops find a familiar idiom in cursors. Unfortunately, the performance of cursors is horrible in almost all systems. One of us (Shasha) once had the experience of rewriting an eight-hour query having nested cursors into a cursor-free query that took 15 seconds. Figure 2 illustrates a less dramatic cursor penalty with a simple experiment. In this case, the experiment consists of retrieving 200,000 rows from the table Employee (each record is 56 bytes) using either a set-oriented formulation (SQL) or cursor to iterate over the table contents (cursor). Using the cursor, records are transmitted from the database server to the application one at a time. The query takes a few seconds with SQL and more than an hour using a cursor. This experiment was run on SQL Server 2000 on Windows 2000.
The moral of these experiments is that you should retrieve all the data you need in one fell swoop to avoid many round- trips across the application-to-database interface. It is important, however, not to retrieve more than is necessary. Even though it is easier to type select * from employee than select lastname, firstname from employee, the database usually has to do more work for the former than the latter.
There are two reasons to retrieve only needed columns one obvious and the other less so, as illustrated in Figure 3:
- The obvious reason is that retrieving an unneeded column causes unnecessary data to be transferred.
- A subtle reason is that retrieving an unneeded column may prevent a query from being answered within (being covered by) the index. For example, if there is a dense composite index on last name/first name, then a query that asks for all first names of people whose last name is "Codd" can be answered by the index alone.
Figure 3 compares the retrieval of one-fourth of the columns with the retrieval of all columns using select *. We performed this experiment in two situations without indexes and with a nonclustering index covering the retrieved columns.
To this point, we've focused on retrieving data. Rapidly inserting data requires understanding the sources of overhead of putting a record into the database:
- Excessive number of round trips across the database interface. This occurs if the batch size of inserts is too small. In fact, in up to 100,000 rows, increases in the batch size improve performance on some systems. In Figure 4, for instance, we used the BULK INSERT command to load 600,500 tuples into the lineitem relation on SQL Server 2000 on Windows 2000. We varied the number of tuples loaded in each batch. The graph shows that throughput increases steadily until the batch size reaches 100,000 tuples, after which there seems to be no further gain. This suggests that a satisfactory trade-off can be found between performance (the larger the batch the better up to a certain point) and the amount of data that has to be reloaded in case of a problem when loading a batch (the smaller the batch the better).
- Updating all the indexes on the table. As Figure 5 shows, even a single index can hurt performance due to the ancillary overhead that inserts cause. Here, we insert 100,000 records in the table Order(ordernum, itemnum, quantity, purchaser, vendor) and measure throughput with or without a nonclustered index defined on the ordernum attribute. The presence of the index significantly impacts performances.
- Layers of software within a database system can get in the way. Database systems provide bulk loading tools that achieve high performance by bypassing some of the database layers (mostly having to do with transactional recovery) that would be traversed if single row INSERT statements were used, as in Figure 6. SQL*Loader is a tool, for instance, that bulk loads data into Oracle databases. It can be configured to bypass the query engine of the database server (using the direct path option). In Figure 6, the performance benefits obtained by bypassing the SQL engine (conventional usage of SQL* Loader with a commit every 100 records) and the storage manager (direct path option of SQL*Loader) compared to the performance of inserts (using one thread and a commit after each insertion). These results were obtained by inserting 600,500 tuples into the lineitem relation.
- The SQL Server BULK INSERT command and SQL*Loader let users define the number of rows or the number of kilobytes per batch. The minimum of the two is used to determine how many rows are loaded in each batch. There is a trade-off between the performance gained by minimizing the transaction overhead in the omitted layers and the work that has to be redone in case a failure occurs.
Smooth Talking Between Databases
There are several ways to get data from collections of databases and to move data among those collections, each with its own sweet spot.
A "federated data warehouse" offers a single-system view of a set of independent databases. Physically, the federated data warehouse consists of the data on each database plus one more server the "federator," which does the necessary final joins and may hold data itself. IBM's DataJoiner (http://www.ibm.com/software/data/datajoiner/), a prototypical example of a data federator, attempts to ensure that the sources do as much work as possible and return as little data as possible for additional processing at the federated server. Updates in federated data warehouses take place on one database at a time without any guarantees of synchronization.
Our puzzle requires the transfer of data between one database and the other, then the "exactly once" processing of that data. So federated data warehouses are inappropriate to solve the puzzle.
Two-phase commit is the standard academic recommendation for implementing synchronized updates across many sites. As its name suggests, the protocol works in two phases. In the first phase of committing a transaction T, a coordinator site asks the various data servers who have done work for T to put their updates on their local "stable storage" (normally some form of disk). If any data server doesn't respond or indicates failure, then in the second phase the coordinator tells all of the remaining servers to drop their updates based on the principle of "atomicity" all of a transaction's updates should be committed or none should be.
On the other hand, if all data servers respond positively to this first request saying they have precommitted, then in the second phase the coordinator (which is normally a data server itself) tells each server to commit.
This protocol is subject to "blocking" if the coordinator fails, then the servers don't know whether to commit or abort the transaction. Even if a new coordinator is elected, the remaining servers still won't know what to do if the coordinator was a server itself. Thus, the failure of one site may cause another one to hold its locks indefinitely. Blocking scares many people.
There is the possible additional problem in which many database applications fail to provide an interface to the first precommitting phase of two-phase commit. So, the company rejects two-phase commit to solve this puzzle.
A popular alternative to the two-phase commit is a replication server. Using a replication server to put the primary (X's) update on the backup (Y's) database entails performing a full dump of the database state to the backup site nightly. During the day, all modifications performed on the primary are sent to the backup (typically, as SQL operations) after they are committed on the primary. With replication server, the backup is within a few seconds of being up to date. Further, replication server allows decision support queries to be answered by the backup machine. Finally, the backup machine can become the primary in a matter of seconds. Replication servers can be a huge administrative headache, however, requiring administrators to keep table schemas in synchrony, ensure that triggers are disabled at the backup site, and be ready when the network connection breaks. Also, the replication server is "warm" not "hot" and some committed transactions can be lost.
An emerging alternative to a replication server is to rely on the features of storage area networks (SANs). For instance, EMC Corp. (http://www.emc.com/) offers SANs that do replication for all or part of a storage device. The replication may go from device A to B for some disk blocks and from B to A for other disk blocks. The replication can be synchronous or asynchronous and can be toggled on/off. When replication is turned off and then turned on, resynchronization brings the backup up to date with respect to the primary. Other replication options are possible. For example, an online transaction storage device can be replicated to a data warehouse storage device every night and then, during the day, turned off to avoid overhead to the primary during the day and to ensure a consistent image of the database to warehouse users. The great advantage of SANs is that the database administration required for replication is greatly reduced, since the database management system need not even be aware of the replication. This gain may overshadow the hardware costs.
In deciding whether a SAN is appropriate, the following rule of thumb is helpful: If the items to be replicated are small and widely dispersed, then the SAN approach is less attractive because it replicates entire blocks. On the other hand, databases that are rewritten sequentially and in bulk are good targets for SANs. So, SANs may or may not work for our puzzle.
Finally, we present an approach to the puzzle that is used in airplanes and other safety-critical applications a replicated state machine. As Figure 7 illustrates, both databases start from the same state, they receive operations in the same order, and perform them one at a time so they end up in the same state. (The operations should be deterministic; that is, they depend neither on time nor on random events, but rather only on the data they read.) The virtue of such a solution is that no commit-time coordination is necessary. The vice is that no concurrency is allowed either, unless you can preanalyze transaction texts to see if they have conflicts. So that won't work for our puzzle either.
Solution to the Puzzle
Here's one solution to the puzzle. Establish a staging table on X that contains data to be postprocessed on Y. On X, any transaction that adds data to be used on Y puts data in the staging table S, perhaps via a trigger. When a row is added to S by such a transaction, it is marked Xonly. On Y, there is an input table I. The data is moved and processed in carefully partitioned transactional chunks.
As shown in Figure 8, a staging table S is created on database X, while a staging table I is created on database Y. A trigger puts data in table S whenever a transaction adds data on X to be used on Y; this data is marked Xonly. Four connection supertransactions are then issued repeatedly to enable postprocessing on Y: M1 marks the data in S as Yunprocessed as a single transaction, M2 reads Yunprocessed data from table S on X in a first transaction and then writes them in a second transaction to table I on Y, M3 processes a data item in I and marks it processed in I as a single transaction, and M4 issues two transactions first, it deletes from table S the Yunprocessed items and (in a second transaction) deletes the processed items from I. Note that supertransactions M1 through M4 are executed in order, while the trigger might put data in S at any point in time.
Consider various failure scenarios: If Y fails, there may be rows marked Yunprocessed in S that have not made it to Y. A postrecover process M5 can simply insert into I rows that are marked Yunprocessed in S but that are not in I. Another possible problem is that rows marked processed in Y have already been deleted from S in X but not from table I in Y. This problem is insignificant because M4 will take care of those rows the next time around. What cannot happen is that rows will be marked processed that haven't actually been processed because M3 is a transaction within database Y. It also cannot happen that data will be processed twice, because deleting processed rows in I is the second transaction of M4 at which point that work is in no other table. If X fails, then data no longer changes in the staging table, but there are no problems with the state markings because all processes that change X and Y change X first.
Conclusion
One of the major principles of tuning any system for speed is to avoid repeated startup costs. This holds for application-to-database communication as well: You want to move as much data as you can between your application and database system with each call.
Interdatabase communication obeys the same rule, but fault tolerance doesn't come as easily because it no longer rests on a single system's transactional guarantees. Data to be sent for much later processing should use a SAN if you can afford one. Data for immediate processing can use replication servers, if up-to-the-second reliability is not essential. Data that must enjoy transactional guarantees can use two-phase commit if your client isn't afraid of it or a staging table if your client is. A replicated state machine works well if your database can do without concurrency control (usually a home-grown database or main memory one).
Database tuning requires an understanding of physics (how disks work), engineering (it's bad to cross interfaces unnecessarily), logic (how to back out of failures), and common sense (knowing what is important to optimize). Often, a few simple experiments can tell you a lot about how to make your system run faster. To help you test the various components of a database system, we've developed a set of SQL scripts that exercise everything from the performance and correctness of the system's concurrency control methods (most are not correct in default mode), to the performance benefits of various indexes and the effectiveness of rewriting. Many of the scripts work against standard benchmarks (TPC-H) and can be scaled to whichever size is most relevant to your application. You can run the scripts from your favorite SQL client or use a tool (based on ODBC and OCI) we provide so that you can run the scripts from multiple threads in parallel and measure response time. The scripts are available at http://www.mkp.com/dbtune/. We encourage you to run and modify our experiments and see for yourself the performance implications of different tuning options on your system. Get down and be smooth.
DDJ