Application-Level Abstractions for Lock-Free Data Sharing

No one ever said that programming for concurrency would be easy


December 21, 2007
URL:http://drdobbs.com/cpp/application-level-abstractions-for-lock-/205200452

Bill is a former Senior Technical Staff Member with IBM who has worked on assignments ranging from software development for a 7090 Emulator for early 360 mainframes to software architectures for mass store systems for NASA. He can be contacted at [email protected].


Concurrent programming is hard. The key concern for application developers is data sharing. Design, analysis, code, verification, test, debug, performance, and portability are all major issues, but most significant of all are concerns for program correctness.

Basic approaches to sharing data across concurrently running tasks have traditionally been based either on locking of data elements, to provide single-task access to them, or else on transactions that incorporate such locks. (Here I use the term "task" to encompass different forms of asynchronous processing, such as threads, processes, or even executions on remote processors.)

An alternative to the use of locks is lock-free data sharing, otherwise known as "wait-free data sharing." Here, applications provide a set of potential updates to change state data from one externally consistent view to another consistent view. In other words, there is no inconsistent data that is visible to allow race conditions. When the updates are completed, an attempt is made to install either all or none of them. If this fails because other parallel task have changed input assumptions, the application processing must be retried. Of course, nothing is for free here, and there is a trade-off between never having to wait, and having to later repeat the update processing.

Lock-free processing has many advantages but is fundamentally more difficult to implement. In this article, I outline these advantages and propose application-level approaches that avoid these difficulties. In fact, these higher level approaches appear to be even easier to use than lock-based ones, both because they provide explicit specification of requirements for reasoning about correctness and because they avoid complex interactions among competing tasks.

The discussion here addresses three levels of support:

In this article, I introduce three application-level abstractions for shared data:

My goals here, in order of importance, are:

Basics of Lock-free Programming

One way to allow multi-task sharing of a referenced object is to use locks, where only one task at a time is granted access to a shared resource. Other tasks wait until the lock is released. A "resource" here is primarily a data object, but it can encompass other entities such as files, ports, specialized processors, data stores, user interfaces, and the like.

Lock-free approaches, on the other hand, are based on hardware synchronization primitives that are non-blocking, so that rather than waiting until a lock is available, tasks run optimistically on the assumption that there is no contention. When a synchronization point is reached, the code then checks for potential conflicts arising from concurrent updates of the shared data by other tasks. In the face of conflicts, the processing is repeated.

Tradeoffs

Lock-free approaches are similar to optimistic locking policies, where locks are only checked at the end of a transaction. Optimistic approaches to data sharing are most useful in low-contention environments. Such policies tend to scale well with large numbers of processors and short execution tasks, where the ability to introduce more concurrency without locking overhead and delays, outweighs the overhead of transaction restarts due to contention.

When the overhead of reprocessing is not excessive, lock-free approaches have significant other advantages in that they also avoid issues such as:

Complexity

In The Concurrency Revolution, Herb Sutter comments:

But, concurrent lock-free programming is extremely hard for programmers to understand and reason about. Most of the time, only systems and library writers should have to understand lock-free programming, although virtually everybody should be able to take advantage of the lock-free systems and libraries those people produce.

The approaches I outline here can be implemented with basic C++ language support for threads along with libraries using lock-free programming techniques, and thereby provide a considerable degree of simplicity for typical application developers.

Policies

A library implementation would provide the basics for the support described here. However, there are many variations that could be adapted to specific needs of a variety of applications. Hence a "policy-based" implementation lets concept users adapt the implementation to specific goals and strategies.

In particular, adaptability and sample classes are needed for a Shared Resource Manager with polices for a Resource Access Wrapper, a Controller that provides the algorithms used by the Shared Resource Manager, and an Allocator with memory control strategies. Additional application-level policies can support such common access requirements as data location and transformations, security control, and error handling and reporting.

Basic Implementation Support

Basic support for the lock-free programming approaches suggested here involves:

Environment Support for Lock-free Programming

Basic synchronization primitives require:

As multi-tasking becomes more common across systems with a variety of memory caches and physical types, it will be necessary to specify additional characteristics of shared object storage. For instance:

The approaches I outline here provide for application specification of the precise data needed to be released and acquired. Not only does this allow more efficient operation on platforms that can exploit this information, but it also provides the specifications needed for reasoning about the correctness of the code.

Shared Resource Manager

The Shared Resource Manager maintains control and visibility to resources that are in contention. It is where the essential support resides for the algorithms necessary for the abstractions discussed here.

Fundamental algorithms for lock-free approaches typically process a sorted list of atomic operations. If processing of this list is interrupted, then other tasks, on finding a conflict, cause one of the tasks to fail and retry, while the other task continues processing.

The typical shared resource is state data shared by separate tasks, but it can include other system data such as files, ports, data stores, devices, and the like.

Allocator

Shared memory may have special allocation requirements based on performance considerations. These can be configured and reported by Synchronization Elements. For specialized memory systems, the Allocator can set and use appropriate memory options and operations. For cached memories, the Allocator can be concerned with such issues as locality of reference, cache line usage and aliasing, and cache look-aside translation buffers.

All instances of data versions that are current, shared or used exclusively need to be tracked. Only when they can no longer be referenced can they be freed and reallocated. This is critical, for instance, in data list structures to avoid an ABA problem.

A requirement for an Allocator, in some situations, is to insure that memory is freed by the same task that allocated it.

ABA Problems

One of the more difficult lock-free programming issues is the so-called "ABA problem." This occurs typically for list structures, where applications check a pointer, on the assumption that if the pointer has not changed, then the referenced data is still the correct reference. For the ABA problem to occur suppose location A is referenced as a next element in the list. If the object at A is deleted from the list, the next pointer may be set to another object at location B and the object at A freed. Later still, the memory at A can be reallocated to a new object that could be inserted at A's previous list position. A typical test that the next element in the list is still the same one, might check that the next list reference is still location A. This is valid only if location A cannot be freed and then reallocated, while the reference exists.

Systems with garbage collection can sometimes avoid ABA difficulties since referenced data cannot be freed and reallocated. However, garbage collection by itself does not handle circular references, which occur in complex data structures that may need to be shared.

One approach is to maintain a change counter that is updated whenever the reference is updated. If both the pointer and the count have not changed, then there has been no intervening update. In some environments, however, it is not possible to atomically update both the counter and the reference directly. For one approach to solving this see Maged M. Michael's paper Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects which presents an approach there keeps track of all tasks referencing an object and prevents reuse of the location if any references still exist. Simpler, as for other shared pointers, would be to just maintain a count of references. A list of references, though, has the advantage that there is the possibility for clean up, if a task fails.

Synchronization Elements

All this suggests that there be mechanisms to configure objects with storage characteristics that can be determined at least at run time. Here I suggest the concept of "Synchronization Elements". Shared data objects can inherit from them and obtain the required interface properties. Synchronization Elements may be as simple as pointers or array indices. Typically they are the subject of CAS operations, but they can be used with other atomic operations.

Synchronization Elements provide an interface that returns a list of data areas, including the Synchronization Element itself and all data areas that can be referenced directly or indirectly through values in the Synchronization Element. This list can be configured at run time. The elements in the list provide information about their referenced storage area, including its address and length, and storage properties. These properties can be environment specific and may need to be configured with environment specific adaptation code.

This interface is used by internal routines to determine what areas of memory need to be released and acquired, or sent and received, from an updating task to receiving tasks. As cache sizes become larger, the ability for memory barriers to release and acquire only required storage can provide significant performance gains. Such support is already available on specialized processors. Also, in many cases, it should be possible for the task acquiring the data to determine its source and then exercise the appropriate protocols to access it.

Language Support

The compiler also needs to know which data elements are not stable across synchronization points. In a C++ compiler, this could be effected with some extension of the concepts inherent in the volatile keyword. (An appropriate replacement might be called "synch".) Routines that establish a synchronization point can have some sort of volatile keyword in their declaration also. Such an explicit declaration of synchronization points requires fewer of them than in other proposals. Also, with care, these declarations would typically be isolated in a few low-level application functions that basically implement the application-specific shared data model.

Object Accessor

At the application level, it is useful to tailor an Object Accessor that encapsulates a proxy for access to shared objects. This can simplify application interfaces, and it also ensures that appropriate access protocols are uniformly used. In some -- but not all -- cases, application code at higher levels can be largely data-sharing agnostic. In addition, an Accessor can be extended to meet particular application interface requirements, such as data transformations.

Shared and Exclusive Use Data

The first application-level abstraction here is based on avoiding sharing conflicts by limiting application access to only data that is guaranteed not to have update conflicts (sometimes known as "thread local" data). In particular this avoids complexities in implementation of "thread safe" objects.

This approach is based on the use of a type of "smart" or encapsulated pointers of two kinds:

A synchronization point lets the updated instance be installed as a new current version, and made available for shared use, if the referenced base has not been changed during the update processing. Otherwise, the application must retry its update of the referenced data.

Shared and Exclusive Use Pointers are proxy objects based on the usual extended pointer semantics.

The use of these pointers is an extension of an approach presented in Lock-Free Data Structures, by Andrei Alexandrescu, and Lock-Free Data Structures with Hazard Pointers, by Andrei Alexandrescu and Maged Michael.

The development of this concept here is in three stages:

Single Access to a Single Object

Single access support is useful when consistent state data for an application is all maintained in a single object. An example might be an object that reflects the status of an external entity, which is subject to updates, but which always provides a consistent point-in-time view of the object.

A Shared Resource Manager and an application Object Accessor provide the basic support here.

The implementation here is supplied by a Shared Resource Manager, which owns all the shared data, which is responsible for maintaining a Current Version pointer for each object, and which tracks multiple instances of objects returned to clients.

In that the Manager is responsible for the constructing, copying and freeing of an object, it essentially requires an object lifetime controller similar to the STL notion of an Allocator which allocates and de-allocates space, and which also constructs and destroys the allocated objects. This "Allocator" can be supplied as a policy to the Manager.

The Manager tracks and identifies referenced data instances as:

All tasks using the same shared data need to reference the data through the same Manager. However, sharing of different data can be controlled by different Managers, with possibly different policies. At the application level, code and data declarations for use by a particular Manager need to be identified in design documents and typically packaged together in the code.

Typically an application would provide an Object Accessor which would be responsible for:

Typical use of the Object Accessor by an updater might look somewhat like:

// Create an Accessor for a Shared Resource Manager   
Library::XUsePtr dataPtr = accessor->getXUsePtr( ) ; // initialize pointer 
while ( ! dataPtr.isInstalled( )  )
{
   dataPtr->updateProcessing (  )  ; // Process data    
   dataPtr.install ( )             ; // Attempt to install updates        
}                                                                                      ;

Single Access to a Collection of Objects

The previous section described shared access to a single object, which could be an entire Collection. Providing Shared and Exclusive Use Pointers to just members of a Collection requires further refinement.

Shared Collections are based on a Map or Index with unique primary keys or index values that reference a Data Object through a Control Element. For each primary key, the Control Element tracks the Current Version, any shared instances, and any update candidates.

Here the Shared Resource Manager is extended to manage either all the primary keys or just those that are being shared.

To manage all the entire collection, the Manager can internally maintain a Map for the Data Objects. However, in many applications with large Collections, only a few members are ever shared at any one time. Here the applications can maintain local copies of the actual Collection, with the Shared Resource Manager maintaining only instances currently being accessed. This is similar to a cache concept for current or recently accessed objects.

The local application collections can take many forms with the principle requirement being that the shared elements are identified by a unique primary key.

The Object Accessor becomes a Collection Adapter. This Adapter can access local collections, and it can create and initialize an internal Map. When accessed, the objects are tracked by the Shared Resource Manager.

Shared Use Pointers essentially serve as iterators for the Collection Adapters. These iterators can be forward, bidirectional or random, as determined by their underlying Collection capabilities.

A Shared Use Pointer can be used to iterate through the local Collection to find interesting Objects. It can then be converted by the application to an Exclusive Use Pointer in order to gain write access to the object.

To add objects to the Collection, an Exclusive Use Pointer is first obtained for the new primary key.

Federated Collections are useful here to provide more than one access path to the primary key of the object to be referenced. Federated Collections report additions, deletions and modifications to other Collections. Here the other collections could also be updated through lock-free techniques as described in the following sections.

Multiple Accesses

So far our discussion has related only to a task updating a single object at a time. Updating multiple objects at the same time requires the notion of a transaction manager that has the property that, when transactions are committed, either all the updates take effect, or none do.

Here, the client starts a transaction, synchronizes a set of Shared and Exclusive Use Pointers for the transaction, and uses the Exclusive Use Pointers to create and modify shared objects. At the end of the transaction, commit processing attempts to install the set of updates.

Synchronization processing occurs at two points:

Synchronization processing to obtain a set of consistent data objects occurs in two phases:

Synchronization processing for the commit processing to install a set of consistent data objects occurs in four phases.

Choosing a transaction to fail can be based on criteria such as priority. Alternatively it may be useful to always advance the oldest transaction. This latter approach improves worst case responsiveness. More importantly though, it insures that the system always makes progress and does not start thrashing with a cycle of transactions that alternately contend for common resources and, thereby, keep failing each other.

Direct Data Synchronization

The above discussed techniques where an application is given sole access to data through a reference, so that most of the code has little need to worry about data update contention. In many cases, however, it is useful to have a more general method to deal with shared data directly, rather than through references. A simple example of this is an addition of a new element to a doubly linked list, where both forward and backward pointers to the new element need to be set simultaneously.

The second abstraction then involves direct update of shared data, rather than through intermediate pointers.

Approaches

Reasoning about resource contention issues is difficult. The usual approaches for reasoning about contention require analysis of all possible multiple orderings of state data access and change relations between contending tasks. Here we discuss a higher level abstraction that can be based on "test and set" notions. My analysis here is just of static states and requires only an examination of static data consistency, rather than of the dynamics of multiple access and store orderings that achieve the consistent state.

The idea here is that reasoning about static consistency is far simpler than the complexity of reasoning about multiple possible sequences of data addition, deletion, change and access.

When it comes to Dynamic Synchronization of State Data Changes, a technical discussion of the requirements for correct programs based on an underlying memory model for C++ can be found in papers by Hans Boehm and Clark Nelson (see C++ Standards Committee notes JTC1/SC22/WG21 - N2171, N2239, N2300). These considerations are necessary for language designers and library authors. The abstraction there is based on reasoning about actions that change state data in terms of "happens before" and "happens after" relationships.

An approach to simplify this somewhat is given by Herb Sutter in ISO/IEC JTC 1/SC 22/WG 21 N2197 = ANSI/INCITS J16 07-0057, where several principles and rules are developed. The approach here is closer to what Sutter suggests.

However, discussions such as these are meant for language experts, compiler writers, and systems designers. A higher level abstraction, that can be built upon these concepts, is needed for application use.

In terms of Static Synchronization of State Data Changes, the model for consistency relies on a view of program transitions from one state to another through synchronization points. Program correctness here relies on verifying assertions that all externally visible state date is consistent and valid, before and after each synchronization point. The underlying library then implements the correct dynamic ordering operations to achieve these consistent states.

The support suggested here comes in two parts:

Extended CAS for Single Updates

Often it is useful to perform an atomic compare-and-swap on a contiguous set of data that is wider than the underlying system support for direct CAS instructions. Especially for some lock-free algorithms, it is necessary to atomically change a control element which contains pointers, counters, and flags. For example, to avoid the ABA problems cited above for a recycled reference, a change counter can be stored and incremented with the reference. Any update to the reference must also increment the counter.

Hence, the extended CAS suggested here is designed to compare-and-swap a Synchronization Element which can be an object of any width. To accomplish this, the compiler can either generate the CAS operation directly, or it can use an algorithm that maintains a pointer or other data reference to indirectly manage atomic update of the Synchronization Element. This also provides programs with cross-platform support that is not dependent on a particular processor instruction set.

A second extension is a Compare-and-Swap-Conditional. Here the data swap occurs only if both a simple Boolean value is True and the referenced data items compare equally. This operation is common in several lock-free algorithms.

Also, it is necessary that appropriate memory barrier instructions be issued based on values derived from the Synchronization Element.

Operations List for Multiple Updates

As for Shared and Exclusive Use Pointers, it is often necessary to perform a set of simultaneous updates transactionally; that is atomically in the sense that all or none of the operations are performed. This provides an important property of lock-free approaches; i.e., they are composable in the sense that an atomic operation can consist of a set of lower level atomic operations. In particular, the Install of an Exclusive Use Pointer can be included in the synchronized operations list.

This operation can be provided through a Synchronize function operating on a list of Extended CAS and similar atomic operations. Atomic operations here are a small library of basic expressions that can often be supported directly on platforms with appropriate processor instructions. The Synchronize performs an atomic transaction where:

In addition to consistency of multiple data changes, it is often necessary that the synchronization is provided to ensure a consistent set of inputs before processing starts. Consistency here requires that:

This is often sufficient, since, in a changing world, the only guarantee that can be made about the results is that they were valid at some point in time.

Consistency checks of both inputs and outputs at the synchronization point would intuitively seem appropriate. However, this can be less efficient. Also it provides little additional benefit, since it only raises the question of how more useful the result would be, if the input assumptions change anyway immediately following the commit operation.

Input synchronization can be provided through a GetConsistent function which takes a list of data objects and returns a proxy or proxies for further access to the objects.

As it did for the Use Pointer transactions above, implementation requires multi-pass operations for both input and result consistency checks. Here a Shared Resource Manager is needed to track all Synchronization Elements, counters, and flags that are in contention.

Input synchronization occurs in two phases:

Commit operation is as follows:

Access to a shared object which is marked as "claimed" involves the following:

Again various policies can be used here to decide upon which task to fail, but with due regard to preventing system thrashing.

Work Queuing

The third abstraction has to do with moving the processing to another task. Typically this is through a queue. Work queuing is used for many reasons. Two possibilities that apply to simplifying requirements for data sharing are:

These considerations apply to sharing of other types of resources, in addition to sharing of data.

This abstraction differs from the previous two in that it involves not only supporting library services, but also the frameworks in which applications run.

Support for Tasking Frameworks

There are several scenarios in which work might be transferred from one task to another.

This can be further developed by having tasks that serve as both clients to some tasks and servers to others.

Queue Manager

Note that these services can imply a queue manager with such auxiliary functions as:

Supporting Services

The work queuing abstraction depends upon two other more basic abstractions; i.e., send/receive services. These services in turn require some wait/event dispatching mechanism.

Work queuing can occur with a variety of short and long processes, of process resource use, and of local and remote processors. A particular challenge is to find hardware and software solutions that efficiently allow queuing of very many very short processes found in typical applications. A simple but powerful example would be to split processing in typical loops, where the iterations are independent of each other, among multiple cores on a single chip, while bypassing stores into main memory.

Work Elements

All cooperating tasks (and a queue manager) need to agree on some standardization of Work Elements.

This can be as sophisticated as SOAP based protocols. A simple approach would provide the following minimal areas for a Work Element:

Exception Handling

Standardizing approaches to multi-tasking generally involves significant discussion about how uncaught exceptions are handled in support tasks and propagated to initiating tasks.

Here the following principles seem applicable:

Of course, for some environments, this distinction between application and system levels may not be completely intuitive and needs to be considered carefully.

Based on this:

Summary

The abstractions above are conceptually independent but in practice they can be used together:

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.