Arch is a principle software engineer at Intel, and a member of ECMA TC39/TG3, the group responsible for drafting the ECMA/ISO Standards for the .NET Common Language Infrastructure (CLI). He can be contacted at [email protected]. This article is strictly his interpretation of the Standards.
With the advent of multithreaded and multicore chips, multiprocessor computers are becoming widespreadand not just for servers (I'm writing this article on a dual-processor personal computer, for instance). While it might seem that multithreaded code written for uniprocessors should port to multiprocessors without change, this is not the case. One reason for this is that modern multiprocessors introduce the notion of memory consistency, in which different threads can see updates to memory in different orders. Understanding the basics of memory consistency is essential to writing multithreaded code that works on both uniprocessors and multiprocessors.
Models of memory consistency vary with hardware to the point where programmers have a hard time keeping up with the latest fashion. In this article, I focus on the release-consistency model because it is both the standard for .NET (and possibly a future version of Java) and a lowest common denominator. Code written for release consistency should work across the major microprocessors, such as the Alpha, PowerPC, Sparc, and IA-32.
Sequential Consistency
In single-processor systems, concurrent threads execute via timeslicing. At any instant, one thread runs while other threads sleep. If there is a write to location X followed by a write to location Y, all running threads agree that the writes happened in that order, and all threads that slept through the writes agree that the writes happened simultaneously. No thread sees Y written before X.
This model is called "sequential consistency" because memory behaves as if there were a sequential ordering of all instructions executed. It is intuitively obvious to use. For example, consider the common problem of sending a message between two processors. Example 1 presents two shared variables: R is a flag that is presumed to be 0 before the threads start, and M is a message buffer. The sender, Example 1(a), writes a message to M and sets R. The receiver, Example 1(b), waits for R to be set and reads the message from M. Sequential memory consistency guarantees that when the receiver sees R become 1, the sender has completed all writes to M.
But sequential consistency is currently out of fashion. Modern processors sport relaxed memory consistency, in which receivers might see the sender's actions in a different order. For instance, receivers might see R become 1 before seeing all of the sender's writes to M. Because the reordering is counterintuitive and breaks the example, it might seem that system architects have gone daft. In fact, they are being clever, motivated by the usual cause for computer screwinessefficiency. Sequential memory consistency severely hinders modern hardware and optimizing compilers.
The hardware inefficiency arises from the mismatch between processor and external bus speed. Processors maintain speed by caching data locally and minimizing accesses through the external bus to main memory. But sequential consistency requires that these caches agree on the order of memory updates, which requires interprocessor communication (snooping) across the slow external bus. Tricks such as speculative execution can ameliorate, but not eliminate, the overhead.
The compilation inefficiency can be even worse, and arises even if the multithreaded code is running on a uniprocessor. Again, consider the message-passing example in Example 1. If the compiler knows that M and R are distinct locations, it is free to rewrite the code as in Example 2. Clearly, such optimizations break the example. C/C++ programmers know, of course, that the solution is to declare both M and R as volatile, so that the compiler knows not to reorder accesses to them. Doing so is a heavy-handed solution because it effectively turns off compiler optimizations related to M and R everywhere, not just in the portion of the code concerned with transmitting M.
What you want is a contract between sender and receiver that enforces the necessary orderings, but no more. From the sender's viewpoint, the contract should guarantee that the receiver reads R (and sees it change to 1) before it begins to read M. From the receiver's viewpoint, the contract should guarantee that the sender writes R only after it finishes writing M. The contract should otherwise leave the hardware and compiler free to reorder or optimize operations on M.
Release Consistency
Release consistency is a form of relaxed consistency that lets you express the ideal contract. It introduces special operations to enforce partial orderings. At the hardware level, the operations are called "acquire" and "release."
- Acquire guarantees that no subsequent memory operation (on any location!) is "moved up" in time before the acquire.
- Release guarantees that no prior memory operation is "moved down" after the write.
The guarantee is only unidirectional. A bidirectional guarantee is called a "memory fence."
In .NET, "acquire" and "release" are tied to special reads and writes, respectively called "volatile reads" and "volatile writes." Fences are supported in .NET by the method System.Threading.Thread.MemoryBarrier(), which is part of the ECMA Standard for .NET (but not in Microsoft's release as of this writing). Henceforth, when I say "volatile," I mean the .NET semantics, not the C/C++ semantics. In C/C++, declaring a location X as volatile affects only accesses to that location. In .NET, a similar declaration not only affects accesses to X, but also the ordering of those accesses relative to accesses to other locations.
The partial ordering guarantees of volatile reads/writes are mysterious until you see them in practice. Message passing is the key idiom. To get the message-passing example in Figure 1 right in .NET, you declare R (but not M) as volatile. Doing so guarantees the orderings you need without the overkill of sequential consistency.
Figure 1 details why this works by showing three viewpoints of what happens. The sender's actions are in red; the receiver's in blue. The ordinary memory operations are denoted by dots; the special ones by diode symbols. Each diode points in the direction that memory operations may migrate. Migration in the other direction is prohibited.
The sender sees its own actions as totally ordered. It writes the elements of M sequentially, and then writes R. From its viewpoint, the receiver repeatedly reads R. When the receiver reads a 1 from R, it subsequently reads the elements of M. From the sender's viewpoint, the receiver reads the elements of M in no particular order.
The receiver sees a symmetrically different view. As far as the receiver is concerned, the sender's writes were unordered, except that the volatile write to R came after all the unordered writes to elements of M. The receiver sees its own actions as totally ordered.
The two views differ, but are consistent. Each thread is convinced that the other thread is indulging in an eccentric, but correct, ordering of memory operations. A hardened theorist would prove the correctness of the algorithm strictly from these relativistic views. But pragmatically, it's often easier to see correctness from a third mutual viewpoint, as on the right of Figure 1. This view incorporates the weakest promises made by both parties. It shows both the sender and receiver as performing unordered operations on M. But the sender's writes of M cannot be moved downward past the volatile write to R, and the receiver's reads of M cannot be moved upward past the volatile read of R. Thus, the diodes form a temporal fence (see the green box in Figure 1) that forces writing of M to precede reading of M.
This example illustrates the beauty of the release/acquire discipline over explicit memory fences: The synthesized fence is minimal, and is located exactly where you need it. If only explicit fences were available, two fences would be necessary: one executed by the sender between the writes to M and R, and one executed by the receiver between the reads from R and M.
The diode orientations do not prevent the volatile write/read from being swapped. That ordering is implicit in the busy loop. If the volatile write/read of R were swapped, the read would find a 0, not a 1, and the receiver's loop would issue another volatile read. This raises the more general point that volatile operations are not guaranteed to be sequentially consistent. The left side of Figure 2 illustrates two processes that each write to a location, and then read the location that the other one wrote. It is legal for the compiler or hardware to reorder the code so that both reads are performed before both writes. In other words, nothing prevents the diodes from migrating past each other in the diagram, so that the events look like the right side of Figure 2. It is a one-way transformation, however; the reverse transformation is blocked by the diodes' orientations. The degenerate left side of Figure 2 also demonstrates why a memory fence cannot be built by simply pairing a volatile read and volatile write: Operations before and after such an attempted fence might migrate to the inside of the fence and occur in arbitrary order. The earlier fence in the messaging example required not only volatile writes and reads, but also that the receiver looped on the read.
The message-passing idiom appears frequently in multithreaded code, though you might not always think of it as message passing. For example, data structures frequently have multiple readers and writers. If one thread writes to the structure, and another thread reads it, they have passed a message. So when writing such code, be sure to think about what the equivalent of the "Ready" flag is. It is not necessarily a simple flag, but could be a counter or pointer that is updated. The tip-off is that it should be the last thing written by the sender and first thing read by the receiver.
Locks
Most programs use higher level synchronization constructs than raw reads and writes. The .NET model of release consistency associates these constructs with implied volatile operations. In particular, acquisition of a lock implies a volatile read, and release of a lock implies a volatile write. These implied volatile operations usually permit omission of explicit volatile declarations, but not always. For example, Listing One (available electronically; see "Resource Center," page 5) implements a shared mailbox as a linked list. All updates to the list are performed under a lock, so that at any time, only one thread is updating the list. Illustrating why no volatile declarations are necessary, Figure 3 shows the mutual view of the events involved in transmitting a single object. Each acquisition/release of a lock forms a diode cage that holds the necessary operations within. By definition of a lock, the cages cannot overlap. Thus, all observers, including sender/receiver, agree that updates to the linked list are done in an appropriate order.
Like a trap, the diode cage keeps things in, not out. For example, the receiver could see the sender's unordered write of letter.Content as slipping into the cage. In the example, there is no harm in this because even if the write gets into the cage, it cannot get out the other side. The next section describes an important idiom requiring more care.
Double Check
"Double check" is an idiom that demonstrates the power of the .NET definition of volatile. Double check is an optimization of lazy initialization. Example 3 shows lazy initialization of a reference Z, and the double-check optimization. In the plain version, a lock is held while the value of Z is checked and possibly initialized. But locks can be slow, and the lock is not needed once Z is initialized. The double-check idiom first checks Z without locking. If Z is not initialized, it then acquires the lock and checks Z again before possibly initializing it. The second check is mandatory because another processor might initialize Z between the time of the first check and acquisition of the lock.
The message is "Z is initialized." The thread that initializes Z is the sender; all other threads are receivers. Figure 4 shows the sequence of operations that transmit the message. The "unordered writes to fields of a Foo" are those performed when "new Foo" executes. This points out a counterintuitive feature of release consistency that must be kept in mind: Unordered reads and writes inside a method call may appear to other threads as occurring outside the call!
The cage created by the lock makes it tempting, but alas incorrect, to leave out the volatile declaration of Z. Figure 5 diagrams the problem. If Z is not volatile, then the construction of Foo and write to Z appear unordered to the receiver. Or, from the sender's viewpoint, the receiver might read fields of the object before reading Z. The root problem is that the handshake through a nonvolatile Z occurs between unordered memory operations. This example underscores the point that whenever a message is sent between threads, the sender should finish the message with a volatile write (possibly implied by releasing a lock), and the receiver should begin the message with a volatile read (possibly implied by acquiring a lock).
Adventurous souls who are also rigorously careful can sometimes bend this rule a bit. If there is a volatile write (possibly implied by a lock) at the end of the constructor for new Foo(), then there's no need to declare Z as volatile.
Java supports a relaxed memory model that breaks the double-check idiom because there is not enough control over the order of memory operations. See "The "Double-Checked Locking is Broken Declaration" (http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking .html) for an analysis of the problem, and how Java might be fixed by adopting the acquire/release interpretation of volatile.
Atomic Operations
Clever programmers sometimes avoid locking by using atomic operations. Atomic operations behave as if they happen instantaneously. Typically, they are directly implemented by processors, or take just a few instructions, and so can be much faster than locks in principle. Whether this is true in practice depends upon many factors, including contention and how well locks are implemented compared to atomic operations. The .NET class System.Threading.Interlocked provides the atomic operations in Table 1.
Listing Two (also available electronically) is a reimplementation of the mailbox example using CompareExchange. Note the use of method Thread.MemoryBarrier. Though not in Microsoft's original .NET run time, it will be in the next release and is part of the ECMA Standard for .NET. It is a bidirectional fence that is crucial to portability because, according to the ECMA Standard, the atomic operations do not perform volatile operations; that is, they do not have acquire/release semantics. Full fences, and not just acquire/release, are required here to prevent the reads/writes of the field Next from migrating downward or the atomic operations from migrating upward. Either could entail disaster. Declaring the field Next as volatile is not a valid alternative: It's not the last location written by the sender or read by the receiver. If you draw a diagram, you see that the diode for Next will point the wrong way.
Conclusion
Release consistency enables portability because commercial processors are as strict or stricter about ordering memory operations. Intel's Itanium supports acquire/release in its full glory. Alpha, PowerPC, Sparc, and IA-32 provide "fence" instructions that prevent migration of reads and/or writes over the fence. Often, multiple flavors of fences are offered for programmers who really want to nitpick over what orderings must be enforced.
Unless you are working in assembly language, you need a compiler that plays along. For .NET, the acquire/release semantics of volatile locations is standard. Otherwise, you are stuck using processor-specific assembly code. Even on Itanium, where C/C++ compilers imbue volatile loads/stores with acquire/release, you are still stuck because the optimizers might move nonvolatile loads/stores across volatile loads/stores. In most cases, the best way is simply to use higher level synchronization primitives such as locks, which typically invoke the processor-specific instructions where necessary. As long as your code makes no assumptions stronger than the diode rules for locking, it should run correctly on most platforms.
Acknowledgments
Rick Hudson, Gururaj Nagendra, John Pieper, and Arch G. Robison commented on earlier drafts.
DDJ