One obvious benefit of the shared cache is to reduce cache underutilization since, when one core is idle, the other core can have access to the whole shared resource. Shared cache also offers faster data transfer between cores than does system memory. The new architecture simplifies the cache coherence logic and reduces the severe penalty caused by false sharing. It is well suited for facilitating multi-core application partitioning and pipelining (Ref [2]). Overall, the shared cache architecture brings more benefits and provides a better performance/cost ratio than does dedicated cache.
This article addresses specific software programming techniques which, by enabling effective use of shared cache, help users design and fine-tune applications for better performance.
Benefits of the shared cache
architecture
This article will reference a dual-core/dual-processor system with
shared L2 cache as shown in Figure 1. However, since the software
techniques discussed are generic, the benefits can be applied to other
shared cache systems. We will not discuss hardware specifics of Intel
Core Duo processors such as power-saving and pre-fetching logic, please
refer to Ref [1] for those
details.
Figure 1.A dual-core dual-processor system |
The benefits of a shared cache system (Figure 1, below) are many:
* Reduce cache
under-utilization
* Reduce cache coherency
complexity
* Reduce false sharing penalty
* Reduce data storage
redundancy at the L2 cache level: the same data only needs to be stored
once in L2 cache
* Reduce front-side bus
traffic: effective data sharing between cores allows data requests to
be resolved at the shared cache level instead of going all the way to
the system memory
* Provide new opportunities and
flexibility to designers:
o Faster
data sharing option between the cores than using system memory
o One core
can pre/post-process data for the other core (application partitioning,
pipelining)
o Alternative
communication mechanisms between cores by using shared cache
The usage models for migrating single-core applications to multi-core can be grouped into two categories. One usage model (A) is to replace multiple single-core PCs with a single multi-core system, in which case users will likely deploy each core just like an individual PC.
The other model (B) is to combine the power of multiple cores in order to get a performance boost for a single application, in which case each core does part of the task in order to achieve the overall performance gain.
To compare the private cache and shared cache systems, we further divide the usage models into four scenarios, as shown in Table 1, below. We conclude that the overall edge belongs to the shared cache architecture.
Table 1. Comparison of private cache and shared cache. |
As show in Table 1, above, the shared cache performance is the same as or better than dedicated cache in these scenarios.
Five performance-tuning tips for
using use shared cache effectively
The caches are largely transparent to the users since they do not
really have explicit control over the operations. However, better
programming techniques will make a difference and help users achieve an
improved cache hit ratio and increased performance.
Tip#1. Use processor affinity to place applications/threads properly. The capability to place threads and applications to a specific core is essential in multi-core environment. Without this control the system may suffer performance degradation caused by unnecessary data contention. Processor affinity functions are provided by various operating systems to allow fine control on thread and application placement.
Windows* example:
/* Set processor affinity */
BOOL WINAPI SetProcessAffinityMask(Handle hProcess,
DWORD_PTR dwProcessAffinityMask);
/* Set thread affinity */
DWORD_PTR WINAPI SetThreadAffinityMask( Handle
hThread, DWORD_PTR
dwThreadAffinityMask);
Linux*
example:
/* Get the CPU affinity for a task */
extern int sched_getaffinity (
pid_t pid, size_t cpusetsize, cpu_set_t
*cpuset);
/* Set the CPU affinity for a task */
extern int sched_setaffinity (
pid_t pid, size_t cpusetsize,
const
cpu_set_t *cpuset);
Tip #2. Use cache blocking to improve cache hit ratio. The cache blocking technique allows data to stay in the cache while being processed by data loops. By reducing the unnecessary cache traffic (fetching and evicting the same data throughout the loops), a better cache hit rate is achieved.
For example, a large data set needs to be processed in a loop many times. If the entire data set does not fit into the cache, the first elements in the cache will be evicted to fit the last elements in. Then, on subsequent iterations through the loop, loading new data will cause the older data to be evicted, possibly creating a domino effect where the entire data set needs to be loaded for each pass through the loop. By sub-dividing the large data set into smaller blocks and running all of the operations on each block before moving on to the next block, there is a likelihood that the entire block will remain in cache through all the operations.
Cache blocking sometimes requires software designers to "think outside the box" in order to choose the flow or logic that, while not the most obvious or natural implementation, offers the optimum cache utilization (Ref[3]).
Tip #3. Use cache-friendly multi-core application partitioning and pipelining. During the migration process from single core to multi-core, users may want to re-partition the single-core application to multiple sub-modules and place them on different cores, in order to achieve a higher degree of parallelism, hide latency and get better performance.
A recent study (Ref [2]) on SNORT (an open source Intrusion Detection software) reveals that supra-linear performance increase can be achieved by proper partitioning and pipelining of the processing. The shared cache is well suited to facilitate such a migration process as it provides a faster mechanism for modules running on different cores to share data and enables the cores to work efficiently together.
Understanding the shared cache architecture and application-specific cache usage scenarios can help designers to partition and pipeline the application in a cache-friendly way. For example, it may be worth separating the contention-only threads (threads that are not sharing any data but need to use cache) from each other, and assigning them to cores that are not sharing the cache (for example, core 0 and core 2 in Figure 1).
On the other hand, it is also important to consider assigning data-sharing threads to cores that are sharing the cache, or even assigning them to the same core if they need to interact with each other very frequently, to take advantage of faster data sharing in the caches.
Tip #4. Minimize updating shared variables. During performance tuning, it is advisable to consider the frequency of updating the shared variables, as a disciplined approach to updating shared variables may improve the performance. One way to do that is by letting each thread maintain its own private copy of the shared data and updating the shared copy only when it is absolutely necessary.
Let us look at an openMP* example first. OpenMP is a collection of compiler directives and runtime libraries that help the compiler to handle threading (Ref[4]). In the following example, a variable shared_var needs to be updated during the loop by all threads. To ensure there is no data race, a "critical" directive is used to ensure that at any given time only one thread is updating the content of shared_var.
A thread will wait before entering the critical section until there is no other thread currently working on the section. Note this scheme requires each thread to update the shared variable every single time. In addition, synchronization has to be used to protect the shared data.
/* The following loop is assigned to N threads with each thread doing some of the iterations. Each thread needs to update shared_var during iteration. Synchronization is maintained by using "critical" section.*/
#pragma omp parallel for private(i)
for (i=0; i< 100; i++){
/* Do other tasks */
x = do_some_work();
/* Use critical section to protect shared_var */
#pramga omp critical
shared_var = shared_var + x;
}
In openMP, there is a reduction clause which lets each thread work on its own copy and then consolidates the results after they all have completed their tasks.
The following is the updated code using the reduction clause:
/* With the use of reduction clause, now each thread
works on its private copy of shared_var and only
consolidate results in the end */
#pragma omp parallel for private(i) shared(x) \
reduction(+: shared_var)
for (i=0; i<100; i++) {
x = do_some_work();
shared_var = shared_var + x;
}
As we see from the above example, not only do we reduce the frequency of updating shared_var, but also we get a performance increase by avoiding synchronization.
Tip #5. Take advantage of L1 cache eviction process by adding delay between write and read. The delayed approach takes advantage of L1 cache capacity eviction by inserting a delay between write and read. Assume two things:1) We have a writer thread on core 0 and a reader thread on core 1 as showin in Figure 1; and
2) We insert a delay period after the writer thread modifies the data and before the reader thread requests the data.
Because of the L1 cache capacity limit, the data may be evicted to the shared L2 cache as a result of this delay. Since the data is already there, the reader thread will get an L2 cache read hit. As a comparison, in the case without the delay, the reader thread will get an L2 cache read miss because the data is still in the private L1 cache in core 0.
The delay in read access does not impact the system throughput, as it only introduces an offset to the read access. The ideal amount of delay is dependent on the L1 cache size and usage situation, as well as the amount of data being processed.
In reality, since there may be multiple contenders for the caches, it is often difficult to determine with accuracy the amount of delay needed for the shared data to be evicted into the shared L2 cache. This technique is only recommended for fine-tuning the applications and squeezing out some possible cycles.
The following is an example showing an implementation of the delay. When N=1, it is the original flow without any delay. The writer thread generates a buffer and the reader thread consumes it immediately. When N>1, a delay is applied to the first read access.
By introducing this offset between write and read access, cache hit ratio may be improved as some of the data may already be evicted into shared L2 cache. As a result, the reader thread will get more L2 cache read hits than would be the case without the delay.
Writer_thread_on_core_0()
{
/* N = 1:
normal flow;
N > 1: flow
with delay */
first produce N buffers;
while
(workAmount-N)
{
signal reader thread to process;
write one buffer;
wait for signal from reader thread;
}
}
Reader_thread_on_core_1()
{
while (workAmount-N)
{
wait for
signal from writer thread;
read one
buffer;
signal writer
thread;
}
read N buffer;
}
Four rules on what to avoid
Rule #1: False
sharing. This is a well-known problem in multi-processor
environments. It happens when threads on different processors write to
the same cache line, but at different locations. There is no real
coherency problem, but a performance penalty is paid as a cache line
status change is detected by the coherency logic.
The shared cache reduces false sharing penalties, both because there is no false sharing at the L2 level and because coherency is maintained over fewer caches. However, false sharing is inherited from the private L1 cache. Moreover, in the multiprocessor systems, there may be false sharing between cores that are not sharing the cache. For example, in Figure 1, threads running on core 0 possibly will run into false sharing issues with threads running on core 2.
One way to fix false sharing issues is to allocate the data to different cache lines. Another way is to move the troubled threads to the same core. More information on false sharing can be found in several articles (Ref [5] and [6]).
Rule #2. Improper thread placement. Improper thread placement may cause unnecessary cache contention. For example, if two applications are not sharing data but both use a significant amount of cache, one should consider assigning them to cores that are not sharing cache, such as core 0 and core 1 in Figure 1.
The utility of Intel VTune analyzer such as sampling wizard are helpful in collecting information on cache usage. A high concentration of L2 events (e.g. L2 Request Miss and L2 Read Miss) usually indicates areas for further investigation.
It is important to assign the applications in contention to cores that are not sharing the cache, while assigning the data-sharing applications to cache-sharing cores or even to the same core to take advantages of L2/L1 data caches. In some cases, performance/cache trade-off decisions need to be made to ensure that the critical modules get proper treatment in terms of cache usage.
Rule #3. Avoid excessive use of synchronization mechanisms. As we partition and pipeline the single-core application to run better on the multi-core systems, synchronization between threads on different cores is often needed to avoid data race.
The semaphore or mutex APIs provided by each operating system are convenient to use. However, excessive usage of these APIs may be costly in terms of performance.
One way to improve performance is to consider less expensive options such as sleep, mwait or monitor. The other way is to keep synchronization operations to a minimum, being careful not to become overly cautious about threading issues and end up protecting all publicly shared variables with synchronization. Users need to use synchronization only if necessary as excessive use of synchronization does have performance implications.
Rule #4. Knowing to turn off hardware pre-fetch. Hardware pre-fetch is useful for applications that have a certain degree of data spatial locality. With pre-fetching, the data is in the cache even before the request happens. When used properly, the pre-fetch can hide memory access latency and improve performance.
However, for applications in which data patterns lack spatial locality, the hardware pre-fetch may have a negative impact on performance. Therefore, it is important for users to understand the data pattern of their specific applications and to consider turning off the pre-fetch option in certain situations.
Conclusion
Because there is no explicit control over the usage, caches are largely
transparent to users. Nevertheless, an understanding of techniques
available for optimizing shared cache (e.g. placing threads and
applications properly on different cores based on data sharing or
contention relationships) can help software and system designers take
fuller advantage of the new multi-core systems.
Shared-cache architecture multi-core processors, such as the Intel Core Duo processor, take a huge step toward bringing the benefits of power-saving, dynamic cache utilization and flexibility to system designers and end-users. The shared cache architecture presents exciting design opportunities and more flexibility for efficient work-partitioning and fast data-sharing among multiple cores.
Tian Tian is a Software Technical Marketing Engineer at Intel Corp. supporting multi-core software development as well as the IXP2XXX product line. His previous experience includes firmware development for Intel IXP4XX product line NPE (Network Processor Engine). He was also the main designer for Intel IXP4XX product line Operating System Abstraction Layer (OSAL) in software release v1.5. Currently he is engaged with developing methods to optimize applications for multi-core processors.
References
1. Intel Smart Cache, http://www.intel.com/products/processor/coreduo/smartcache.htm
2. "Supra-linear Packet Processing Performance with Intel Multi-core
Processors" White
Paper Intel Corporation, 2006
3. Phil Kerly, Cache
Blocking Techniques on Hyper-Threading Technology Enabled Processors.
4. openMP* website, http://www.openmp.org
5. Andrew Binstock, Data
Placement in Threaded Programs
6. Dean Chandler, Reduce
False Sharing in .NET*,
7. "IA-32 Intel Architecture Optimization Reference Manual", Intel
Corporation, 2005
8. "Developing Multithreaded Applications: A Platform Consistent
Approach", Intel Corporation, March 2003
9. Don Anderson and Tom Shanley, "Intel Pentium Processor System
Architecture, 2nd Edition", Addison Wesley, 1995
10. Shameem Akhter and Jason Roberts, Multi-Core Programming:
Increasing Performance through Software Multi-threading