Dr. Dobb's | Application-Level Abstractions for Lock-Free Data Sharing

Application-Level Abstractions for Lock-Free Data Sharing

No one ever said that programming for concurrency would be easy

December 21, 2007
URL:http://drdobbs.com/cpp/application-level-abstractions-for-lock-/205200452

Bill is a former Senior Technical Staff Member with IBM who has worked on assignments ranging from software development for a 7090 Emulator for early 360 mainframes to software architectures for mass store systems for NASA. He can be contacted at [email protected].

Concurrent programming is hard. The key concern for application developers is data sharing. Design, analysis, code, verification, test, debug, performance, and portability are all major issues, but most significant of all are concerns for program correctness.

Basic approaches to sharing data across concurrently running tasks have traditionally been based either on locking of data elements, to provide single-task access to them, or else on transactions that incorporate such locks. (Here I use the term "task" to encompass different forms of asynchronous processing, such as threads, processes, or even executions on remote processors.)

An alternative to the use of locks is lock-free data sharing, otherwise known as "wait-free data sharing." Here, applications provide a set of potential updates to change state data from one externally consistent view to another consistent view. In other words, there is no inconsistent data that is visible to allow race conditions. When the updates are completed, an attempt is made to install either all or none of them. If this fails because other parallel task have changed input assumptions, the application processing must be retried. Of course, nothing is for free here, and there is a trade-off between never having to wait, and having to later repeat the update processing.

Lock-free processing has many advantages but is fundamentally more difficult to implement. In this article, I outline these advantages and propose application-level approaches that avoid these difficulties. In fact, these higher level approaches appear to be even easier to use than lock-based ones, both because they provide explicit specification of requirements for reasoning about correctness and because they avoid complex interactions among competing tasks.

The discussion here addresses three levels of support:

A description of the basic support environment necessary for lock-free algorithms.
A description of a set of application-level abstractions for data sharing. These abstractions can also be supported with lock-based algorithms, where the environment support is insufficient, but they are far more powerful when implemented with lock-free algorithms.
An outline of the library support needed to provide a bridge that implements the application-level abstractions using the basic support level.

In this article, I introduce three application-level abstractions for shared data:

Avoidance is provided by a type of "smart" pointer abstraction that assures only single task access to shared data,
Synchronization is provided primarily by a extension of the compare-and-swap (CAS) idiom.
Moving the problem is provided for by a work queuing abstraction, which in turn is based on simple send/receive abstractions.

My goals here, in order of importance, are:

Correctness. The key to correctness is the ability to analyze the points of contention and provide assurance that the functioning code is correct. In addition to the usual requirements, "correct" here implies that only valid or consistent state data is operated on, that there are no race conditions or deadlocks, and that all possible time orderings of data updates provide valid results. While it is impossible to insure correctness of code, it is possible to provide abstractions that greatly facilitate reasoning about its validity.
Usability. The usability intent here is to provide a set of abstractions that can be implemented in libraries by experts, then readily used by experienced programmers with some confidence of correctness.
Performance. The primary performance intent is to provide abstractions that can be implemented with "lock-free" programming techniques. A secondary performance concern, only partially dealt with here, is the effective use of shared memory and cache resources.
Completeness. The completeness intent I present here can simply and adequately satisfy basically all common scenarios for application use.

Basics of Lock-free Programming

One way to allow multi-task sharing of a referenced object is to use locks, where only one task at a time is granted access to a shared resource. Other tasks wait until the lock is released. A "resource" here is primarily a data object, but it can encompass other entities such as files, ports, specialized processors, data stores, user interfaces, and the like.

Lock-free approaches, on the other hand, are based on hardware synchronization primitives that are non-blocking, so that rather than waiting until a lock is available, tasks run optimistically on the assumption that there is no contention. When a synchronization point is reached, the code then checks for potential conflicts arising from concurrent updates of the shared data by other tasks. In the face of conflicts, the processing is repeated.

Tradeoffs

Lock-free approaches are similar to optimistic locking policies, where locks are only checked at the end of a transaction. Optimistic approaches to data sharing are most useful in low-contention environments. Such policies tend to scale well with large numbers of processors and short execution tasks, where the ability to introduce more concurrency without locking overhead and delays, outweighs the overhead of transaction restarts due to contention.

When the overhead of reprocessing is not excessive, lock-free approaches have significant other advantages in that they also avoid issues such as:

Deadlock which occurs when one task holding a resource requests a resource held by a second task which is in turn waiting on a resource held by the first task to be freed.
Priority Inversion which occurs when high-priority lock requesters wait for lower priority requesters.
Stalls which occur when tasks wait on tasks that are not running or may even have failed.
Convoying which occurs when multiple tasks wait in turn for a lock and then wind up running literally in lock step, and, therefore, serially.

Complexity

In The Concurrency Revolution, Herb Sutter comments:

But, concurrent lock-free programming is extremely hard for programmers to understand and reason about. Most of the time, only systems and library writers should have to understand lock-free programming, although virtually everybody should be able to take advantage of the lock-free systems and libraries those people produce.

The approaches I outline here can be implemented with basic C++ language support for threads along with libraries using lock-free programming techniques, and thereby provide a considerable degree of simplicity for typical application developers.

Policies

A library implementation would provide the basics for the support described here. However, there are many variations that could be adapted to specific needs of a variety of applications. Hence a "policy-based" implementation lets concept users adapt the implementation to specific goals and strategies.

In particular, adaptability and sample classes are needed for a Shared Resource Manager with polices for a Resource Access Wrapper, a Controller that provides the algorithms used by the Shared Resource Manager, and an Allocator with memory control strategies. Additional application-level policies can support such common access requirements as data location and transformations, security control, and error handling and reporting.

Basic Implementation Support

Basic support for the lock-free programming approaches suggested here involves:

Environment support
A Shared Resource Manager to maintain versions of current, in use and potentially updated instances of shared data.
An Allocator to manage instances of application data as well as internal control elements.
Synchronization Elements to report storage characteristics for data sharing.
An application-level Object Accessor.

Environment Support for Lock-free Programming

Basic synchronization primitives require:

Atomic operations involving some concept of an atomic test-and-set, where data is set only if the test is true, and where there can be no intervening operations to change the data between the test and the set. Typical machine operations are compare-and-swap (CAS) as well as simpler operations such as counter increment or decrement.
Memory barriers that ensure all relevant data changes are visible to other machine processors that can observe results of a successful test and set. In systems with cache memory, this requires a "release" by the modifying processor to flush changed data to main memory, and an "acquire" by an observing processor, to typically invalidate cache pages so that they will be refetched on the next access.

As multi-tasking becomes more common across systems with a variety of memory caches and physical types, it will be necessary to specify additional characteristics of shared object storage. For instance:

Data, that is accessible to multiple tasks running on processors on the same chip, does not need to be flushed through main memory.
Memory release and acquire operations do not necessarily have to flush or invalidate all memory currently cached by a task.
Memory may be conceptually shared with remote processors through communication links.

The approaches I outline here provide for application specification of the precise data needed to be released and acquired. Not only does this allow more efficient operation on platforms that can exploit this information, but it also provides the specifications needed for reasoning about the correctness of the code.

Shared Resource Manager

The Shared Resource Manager maintains control and visibility to resources that are in contention. It is where the essential support resides for the algorithms necessary for the abstractions discussed here.

Fundamental algorithms for lock-free approaches typically process a sorted list of atomic operations. If processing of this list is interrupted, then other tasks, on finding a conflict, cause one of the tasks to fail and retry, while the other task continues processing.

The typical shared resource is state data shared by separate tasks, but it can include other system data such as files, ports, data stores, devices, and the like.

Allocator

Shared memory may have special allocation requirements based on performance considerations. These can be configured and reported by Synchronization Elements. For specialized memory systems, the Allocator can set and use appropriate memory options and operations. For cached memories, the Allocator can be concerned with such issues as locality of reference, cache line usage and aliasing, and cache look-aside translation buffers.

All instances of data versions that are current, shared or used exclusively need to be tracked. Only when they can no longer be referenced can they be freed and reallocated. This is critical, for instance, in data list structures to avoid an ABA problem.

A requirement for an Allocator, in some situations, is to insure that memory is freed by the same task that allocated it.

ABA Problems

One of the more difficult lock-free programming issues is the so-called "ABA problem." This occurs typically for list structures, where applications check a pointer, on the assumption that if the pointer has not changed, then the referenced data is still the correct reference. For the ABA problem to occur suppose location A is referenced as a next element in the list. If the object at A is deleted from the list, the next pointer may be set to another object at location B and the object at A freed. Later still, the memory at A can be reallocated to a new object that could be inserted at A's previous list position. A typical test that the next element in the list is still the same one, might check that the next list reference is still location A. This is valid only if location A cannot be freed and then reallocated, while the reference exists.

Systems with garbage collection can sometimes avoid ABA difficulties since referenced data cannot be freed and reallocated. However, garbage collection by itself does not handle circular references, which occur in complex data structures that may need to be shared.

One approach is to maintain a change counter that is updated whenever the reference is updated. If both the pointer and the count have not changed, then there has been no intervening update. In some environments, however, it is not possible to atomically update both the counter and the reference directly. For one approach to solving this see Maged M. Michael's paper Hazard Pointers: Safe Memory Reclamation for Lock-Free Objects which presents an approach there keeps track of all tasks referencing an object and prevents reuse of the location if any references still exist. Simpler, as for other shared pointers, would be to just maintain a count of references. A list of references, though, has the advantage that there is the possibility for clean up, if a task fails.

Synchronization Elements

All this suggests that there be mechanisms to configure objects with storage characteristics that can be determined at least at run time. Here I suggest the concept of "Synchronization Elements". Shared data objects can inherit from them and obtain the required interface properties. Synchronization Elements may be as simple as pointers or array indices. Typically they are the subject of CAS operations, but they can be used with other atomic operations.

Synchronization Elements provide an interface that returns a list of data areas, including the Synchronization Element itself and all data areas that can be referenced directly or indirectly through values in the Synchronization Element. This list can be configured at run time. The elements in the list provide information about their referenced storage area, including its address and length, and storage properties. These properties can be environment specific and may need to be configured with environment specific adaptation code.

This interface is used by internal routines to determine what areas of memory need to be released and acquired, or sent and received, from an updating task to receiving tasks. As cache sizes become larger, the ability for memory barriers to release and acquire only required storage can provide significant performance gains. Such support is already available on specialized processors. Also, in many cases, it should be possible for the task acquiring the data to determine its source and then exercise the appropriate protocols to access it.

Language Support

The compiler also needs to know which data elements are not stable across synchronization points. In a C++ compiler, this could be effected with some extension of the concepts inherent in the volatile keyword. (An appropriate replacement might be called "synch".) Routines that establish a synchronization point can have some sort of volatile keyword in their declaration also. Such an explicit declaration of synchronization points requires fewer of them than in other proposals. Also, with care, these declarations would typically be isolated in a few low-level application functions that basically implement the application-specific shared data model.

Object Accessor

At the application level, it is useful to tailor an Object Accessor that encapsulates a proxy for access to shared objects. This can simplify application interfaces, and it also ensures that appropriate access protocols are uniformly used. In some -- but not all -- cases, application code at higher levels can be largely data-sharing agnostic. In addition, an Accessor can be extended to meet particular application interface requirements, such as data transformations.

Shared and Exclusive Use Data

The first application-level abstraction here is based on avoiding sharing conflicts by limiting application access to only data that is guaranteed not to have update conflicts (sometimes known as "thread local" data). In particular this avoids complexities in implementation of "thread safe" objects.

This approach is based on the use of a type of "smart" or encapsulated pointers of two kinds:

Shared Use Pointers have a constant reference to read-only data instances, and hence there can be no update conflicts. An updater might install a new instance of the object, but the readers can continue with their previous but consistent version of the data.
Exclusive Use Pointers reference data instances that are guaranteed to be accessible by only a single task at a time.

A synchronization point lets the updated instance be installed as a new current version, and made available for shared use, if the referenced base has not been changed during the update processing. Otherwise, the application must retry its update of the referenced data.

Shared and Exclusive Use Pointers are proxy objects based on the usual extended pointer semantics.

The use of these pointers is an extension of an approach presented in Lock-Free Data Structures, by Andrei Alexandrescu, and Lock-Free Data Structures with Hazard Pointers, by Andrei Alexandrescu and Maged Michael.

The development of this concept here is in three stages:

A single access to a single object.
Single access to an object from a collection.
Multiple accesses to different objects.

Single Access to a Single Object

Single access support is useful when consistent state data for an application is all maintained in a single object. An example might be an object that reflects the status of an external entity, which is subject to updates, but which always provides a consistent point-in-time view of the object.

A Shared Resource Manager and an application Object Accessor provide the basic support here.

The implementation here is supplied by a Shared Resource Manager, which owns all the shared data, which is responsible for maintaining a Current Version pointer for each object, and which tracks multiple instances of objects returned to clients.

For Shared Use Pointers, the Manager returns a reference to an instance of the then Current Version. An initial or default constructed instance is provided when no Current Version exists. The Manager tracks either the number of, or a list of references to, each shared instance and only allows release of the storage when there are no longer any references to it. Maintaining a lock-free list is more complex than just using a counter, but it can allow recovery, when the owner of reference fails and does not release it.
For Exclusive Use Pointers, the Manager creates a copy of the Current Version which can then be updated. To make the new version available to other tasks, the application attempts to Install the updated instance. If the Current Version has not changed, this succeeds; otherwise, the operation is flagged and the application is given a new instance of the then Current Version and must repeat the update processing.

In that the Manager is responsible for the constructing, copying and freeing of an object, it essentially requires an object lifetime controller similar to the STL notion of an Allocator which allocates and de-allocates space, and which also constructs and destroys the allocated objects. This "Allocator" can be supplied as a policy to the Manager.

The Manager tracks and identifies referenced data instances as:

Update candidates, until installed,
Current, when successfully installed,
In use, when there are current Shared Use references,
Retired, when no longer referenced.

All tasks using the same shared data need to reference the data through the same Manager. However, sharing of different data can be controlled by different Managers, with possibly different policies. At the application level, code and data declarations for use by a particular Manager need to be identified in design documents and typically packaged together in the code.

Typically an application would provide an Object Accessor which would be responsible for:

Initially defining and accessing an appropriate Shared Resource Manager.
Returning Shared or Exclusive Use Pointers supplied by the Manager.
Possibly initializing the referenced object on first use.
Supplying a destructor for the Pointers to release use of the referenced objects.
Providing an Install interface for clients of an Exclusive Use Pointer to register an updated object.

Typical use of the Object Accessor by an updater might look somewhat like:

// Create an Accessor for a Shared Resource Manager   
Library::XUsePtr dataPtr = accessor->getXUsePtr( ) ; // initialize pointer 
while ( ! dataPtr.isInstalled( )  )
{
   dataPtr->updateProcessing (  )  ; // Process data    
   dataPtr.install ( )             ; // Attempt to install updates        
}                                                                                      ;

Single Access to a Collection of Objects

The previous section described shared access to a single object, which could be an entire Collection. Providing Shared and Exclusive Use Pointers to just members of a Collection requires further refinement.

Shared Collections are based on a Map or Index with unique primary keys or index values that reference a Data Object through a Control Element. For each primary key, the Control Element tracks the Current Version, any shared instances, and any update candidates.

Here the Shared Resource Manager is extended to manage either all the primary keys or just those that are being shared.

To manage all the entire collection, the Manager can internally maintain a Map for the Data Objects. However, in many applications with large Collections, only a few members are ever shared at any one time. Here the applications can maintain local copies of the actual Collection, with the Shared Resource Manager maintaining only instances currently being accessed. This is similar to a cache concept for current or recently accessed objects.

The local application collections can take many forms with the principle requirement being that the shared elements are identified by a unique primary key.

The Object Accessor becomes a Collection Adapter. This Adapter can access local collections, and it can create and initialize an internal Map. When accessed, the objects are tracked by the Shared Resource Manager.

Shared Use Pointers essentially serve as iterators for the Collection Adapters. These iterators can be forward, bidirectional or random, as determined by their underlying Collection capabilities.

A Shared Use Pointer can be used to iterate through the local Collection to find interesting Objects. It can then be converted by the application to an Exclusive Use Pointer in order to gain write access to the object.

To add objects to the Collection, an Exclusive Use Pointer is first obtained for the new primary key.

Federated Collections are useful here to provide more than one access path to the primary key of the object to be referenced. Federated Collections report additions, deletions and modifications to other Collections. Here the other collections could also be updated through lock-free techniques as described in the following sections.

Multiple Accesses

So far our discussion has related only to a task updating a single object at a time. Updating multiple objects at the same time requires the notion of a transaction manager that has the property that, when transactions are committed, either all the updates take effect, or none do.

Here, the client starts a transaction, synchronizes a set of Shared and Exclusive Use Pointers for the transaction, and uses the Exclusive Use Pointers to create and modify shared objects. At the end of the transaction, commit processing attempts to install the set of updates.

Synchronization processing occurs at two points:

At the start of a transaction to obtain pointers to a set of consistent data objects.
At the end of a transaction to install a new consistent set of updated objects.

Synchronization processing to obtain a set of consistent data objects occurs in two phases:

A list is made of Current Version pointers for all the data objects requested.
The list is then scanned to see that the all the entries still reference their Current Version. If not, the process is repeated until a consistent set is found.

Synchronization processing for the commit processing to install a set of consistent data objects occurs in four phases.

A list is made with previous and proposed pointers for all data objects that have changes pending for the transaction. The list is ordered so that all transactions process common elements in the same order.
An attempt is made to claim ownership of each element in the list. If there is an element conflict with another transaction, one of the transactions is flagged as failed. If a transaction can claim all elements in the list and is not flagged, then it succeeds.
When the transaction has succeeded, proposed pointers are installed, either by the transaction that is successful, or by any other task that attempts to use the referenced objects.
When the transaction completes, it returns a flag (or throws an exception) to the synchronization request.

Choosing a transaction to fail can be based on criteria such as priority. Alternatively it may be useful to always advance the oldest transaction. This latter approach improves worst case responsiveness. More importantly though, it insures that the system always makes progress and does not start thrashing with a cycle of transactions that alternately contend for common resources and, thereby, keep failing each other.

Direct Data Synchronization

The above discussed techniques where an application is given sole access to data through a reference, so that most of the code has little need to worry about data update contention. In many cases, however, it is useful to have a more general method to deal with shared data directly, rather than through references. A simple example of this is an addition of a new element to a doubly linked list, where both forward and backward pointers to the new element need to be set simultaneously.

The second abstraction then involves direct update of shared data, rather than through intermediate pointers.

Approaches

Reasoning about resource contention issues is difficult. The usual approaches for reasoning about contention require analysis of all possible multiple orderings of state data access and change relations between contending tasks. Here we discuss a higher level abstraction that can be based on "test and set" notions. My analysis here is just of static states and requires only an examination of static data consistency, rather than of the dynamics of multiple access and store orderings that achieve the consistent state.

The idea here is that reasoning about static consistency is far simpler than the complexity of reasoning about multiple possible sequences of data addition, deletion, change and access.

When it comes to Dynamic Synchronization of State Data Changes, a technical discussion of the requirements for correct programs based on an underlying memory model for C++ can be found in papers by Hans Boehm and Clark Nelson (see C++ Standards Committee notes JTC1/SC22/WG21 - N2171, N2239, N2300). These considerations are necessary for language designers and library authors. The abstraction there is based on reasoning about actions that change state data in terms of "happens before" and "happens after" relationships.

An approach to simplify this somewhat is given by Herb Sutter in ISO/IEC JTC 1/SC 22/WG 21 N2197 = ANSI/INCITS J16 07-0057, where several principles and rules are developed. The approach here is closer to what Sutter suggests.

However, discussions such as these are meant for language experts, compiler writers, and systems designers. A higher level abstraction, that can be built upon these concepts, is needed for application use.

In terms of Static Synchronization of State Data Changes, the model for consistency relies on a view of program transitions from one state to another through synchronization points. Program correctness here relies on verifying assertions that all externally visible state date is consistent and valid, before and after each synchronization point. The underlying library then implements the correct dynamic ordering operations to achieve these consistent states.

The support suggested here comes in two parts:

An extended CAS operation for a Synchronization Element for single updates.
A list of CAS and other atomic operations for modification of multiple non-contiguous items.

Extended CAS for Single Updates

Often it is useful to perform an atomic compare-and-swap on a contiguous set of data that is wider than the underlying system support for direct CAS instructions. Especially for some lock-free algorithms, it is necessary to atomically change a control element which contains pointers, counters, and flags. For example, to avoid the ABA problems cited above for a recycled reference, a change counter can be stored and incremented with the reference. Any update to the reference must also increment the counter.

Hence, the extended CAS suggested here is designed to compare-and-swap a Synchronization Element which can be an object of any width. To accomplish this, the compiler can either generate the CAS operation directly, or it can use an algorithm that maintains a pointer or other data reference to indirectly manage atomic update of the Synchronization Element. This also provides programs with cross-platform support that is not dependent on a particular processor instruction set.

A second extension is a Compare-and-Swap-Conditional. Here the data swap occurs only if both a simple Boolean value is True and the referenced data items compare equally. This operation is common in several lock-free algorithms.

Also, it is necessary that appropriate memory barrier instructions be issued based on values derived from the Synchronization Element.

Operations List for Multiple Updates

As for Shared and Exclusive Use Pointers, it is often necessary to perform a set of simultaneous updates transactionally; that is atomically in the sense that all or none of the operations are performed. This provides an important property of lock-free approaches; i.e., they are composable in the sense that an atomic operation can consist of a set of lower level atomic operations. In particular, the Install of an Exclusive Use Pointer can be included in the synchronized operations list.

This operation can be provided through a Synchronize function operating on a list of Extended CAS and similar atomic operations. Atomic operations here are a small library of basic expressions that can often be supported directly on platforms with appropriate processor instructions. The Synchronize performs an atomic transaction where:

Either all the operations in the list are performed, or else none are. In the latter case, the Synchronize is flagged as failed.
None of the intermediate states are observable by other tasks.

In addition to consistency of multiple data changes, it is often necessary that the synchronization is provided to ensure a consistent set of inputs before processing starts. Consistency here requires that:

All input values are known to be consistent with each other at the time of the transaction's beginning.
Results of the computation are consistent with each other and with these inputs.

This is often sufficient, since, in a changing world, the only guarantee that can be made about the results is that they were valid at some point in time.

Consistency checks of both inputs and outputs at the synchronization point would intuitively seem appropriate. However, this can be less efficient. Also it provides little additional benefit, since it only raises the question of how more useful the result would be, if the input assumptions change anyway immediately following the commit operation.

Input synchronization can be provided through a GetConsistent function which takes a list of data objects and returns a proxy or proxies for further access to the objects.

As it did for the Use Pointer transactions above, implementation requires multi-pass operations for both input and result consistency checks. Here a Shared Resource Manager is needed to track all Synchronization Elements, counters, and flags that are in contention.

Input synchronization occurs in two phases:

A list is made of current values for the data objects requested.
The list is then scanned to see that no values have changed, otherwise the process is repeated until a consistent set is found.

Commit operation is as follows:

A first pass constructs an internal list and adds objects potentially in contention to the Shared Resource Manager.
The list is then sorted by some criteria, so that all transactions process common elements in the same order.
Next an attempt is made by the transaction to conditionally claim each item in the list. A claim here is just a flag and doesn't change the value of the object yet. If another transaction has already made a claim, this or the other transaction is marked as failed. However, if the other transaction is already marked as committed, then this transaction needs to fail.
If all the items can be claimed and the transaction has not been marked as failed, the transaction is marked as committed and all the atomic actions in the list can then be executed. These actions can be executed by the committing task, or, if the task is pre-empted, then by other tasks that require access to the shared objects.

Access to a shared object which is marked as "claimed" involves the following:

If the claiming transaction is marked as failed, use the old value.
If the claiming transaction is marked as committed, attempt the atomic action and, if successful, use the new value.
If the claiming transaction is still in the process of committing, continue the list processing of the claiming task until it is either committed or failed. Then proceed as above.
Finally, insure cleanup of the transaction, if this operation is for the last claim.

Again various policies can be used here to decide upon which task to fail, but with due regard to preventing system thrashing.

Work Queuing

The third abstraction has to do with moving the processing to another task. Typically this is through a queue. Work queuing is used for many reasons. Two possibilities that apply to simplifying requirements for data sharing are:

The data for the queued work can be isolated during the computation.
The processing can be moved to an environment where data is only accessed locally and by only one task at a time.

These considerations apply to sharing of other types of resources, in addition to sharing of data.

This abstraction differs from the previous two in that it involves not only supporting library services, but also the frameworks in which applications run.

Support for Tasking Frameworks

There are several scenarios in which work might be transferred from one task to another.

Server tasks are spawned as needed to run in parallel with a main task, which then asks for status when results are needed. For fine grained work, this can introduce considerable overhead.
A server task is event driven, and runs continuously or waits on a queue for work.
The client itself can be event driven and wait for responses from one or more work requests that have been farmed out.

This can be further developed by having tasks that serve as both clients to some tasks and servers to others.

Queue Manager

Note that these services can imply a queue manager with such auxiliary functions as:

Normalizing formats for work requests and responses. This can include marshalling and translation of request and response data elements to and from differing specific application formats and a common standard format.
Supporting queues that can be bounded or extensible; that are first-in-first-out or by priority
Managing a pool of server tasks and possibly the resources used by the pool members.
Supporting multiple queues to aid in distributing work to server tasks. Allocation of work can be based on a combination of:
- Balancing resources to distribute work to different classes of server tasks
- Maximizing response time for short work elements
Maintaining and reporting status and statistics.
Finally, the queuing support may be part of a larger workflow framework and the queue manager can support all of, or a portion of, a workflow specification

Supporting Services

The work queuing abstraction depends upon two other more basic abstractions; i.e., send/receive services. These services in turn require some wait/event dispatching mechanism.

Send/receive services can be policies that are primarily application specific.
Wait/event services can be policies that are primarily platform and environment specific.

Work queuing can occur with a variety of short and long processes, of process resource use, and of local and remote processors. A particular challenge is to find hardware and software solutions that efficiently allow queuing of very many very short processes found in typical applications. A simple but powerful example would be to split processing in typical loops, where the iterations are independent of each other, among multiple cores on a single chip, while bypassing stores into main memory.

Work Elements

All cooperating tasks (and a queue manager) need to agree on some standardization of Work Elements.

This can be as sophisticated as SOAP based protocols. A simple approach would provide the following minimal areas for a Work Element:

Control to maintain priority, status, and possible timeout parameters.
Request request type and reference to inputs.
Response completion status and reference to outputs.

Exception Handling

Standardizing approaches to multi-tasking generally involves significant discussion about how uncaught exceptions are handled in support tasks and propagated to initiating tasks.

Here the following principles seem applicable:

At the application level, it is useful to separate the raising of exception conditions from the handling of them without regard to layers of intermediate function calls.
At the system level, fundamental services need, at least as an option, to return error codes and status to callers rather than raising exceptions. This provides efficiency and allows critical application-level services to make appropriate choices for error handling, including recovery, propagation, and reporting.

Of course, for some environments, this distinction between application and system levels may not be completely intuitive and needs to be considered carefully.

Based on this:

Servers should be relatively simple and protect themselves from uncaught exceptions. Uncaught exceptions would then be an indication of overall system failure.
Servers should catch any exceptions from the work elements they dispatch and return status to the clients.
Work element status should include reporting of exception conditions
The client receiver of work responses in the requesting task can then decide to propagate an exception or simply return a status code. It can also provide immediate reporting of the condition.

Summary

The abstractions above are conceptually independent but in practice they can be used together:

Work queuing sets up the environment and determines requirements for data sharing.
Shared Use Pointers provide constrained access to synchronized inputs needed for Direct Synchronization.
Exclusive Use Pointer implementation provides support that can be used for Direct Synchronization with an extended CAS.
Direct Synchronization can include Use Pointers for initialization and installing of Exclusive Use Pointers in the atomic operations list.