Dr. Dobb's | High-Performance Computing: RAM versus CPU

HPC users are still demanding more performance as they try to solve more complex problems and desire faster turnaround.

High-performance computing (HPC) typically involves running mathematical simulations on computer systems. A few examples of commercial HPC are the simulation of car crashes for structural design, molecular interaction for new drug design, and the airflow over automobiles or airplanes. In government and research institutions, scientists are simulating galaxy creation, fusion energy, and global warming, as well as working to create more accurate short- and long-term weather forecasts.

For most HPC simulations that are run, the CPU instruction mix contains significant amounts of floating-point calculations, and less integer calculations. A principle often associated with CPUs is Moore's Law, which states that the density of transistors that can be put onto a chip doubles about every two years. Recently, the increase in CPU speeds has slowed in terms of single processor performance. Most commercial applications running today do not take advantage of the new multi-core designs, and thus are being over-served by the increasing density associated with Moore's Law. These types of applications just don't need the increase amounts of computing power that is provided today.

Despite this, HPC users are still demanding more performance as they try to solve more complex problems and desire faster turnaround. Other markets demanding more performance include those that are involved with delivering compelling, new content over the Web, and enterprise customers who are expanding their offerings to either employees or customers. Figure 1 illustrates this trend. In this article, I delve further into the concepts and issues surrounding HPC, specifically taking a look at CPU and RAM in relation to HPC.

[Click image to view at full size]

Figure 1: The Redshift

Basic Data Flow

Most modern computers work in a similar way when executing large simulation problems. The basic HPC server today consists of one or more CPUs which perform the arithmetic and RAM. RAM contains a server's applications instructions, as well as the data needed to run the application. When the application needs to write out the results, the system needs to access the I/O system and in certain cases needs to use the network connection to communicate with other machines.

There are many ways to design a computer system, but an important factor is that the different sub-systems within a computer are balanced. If the CPU is very fast, but can't be fed data from the memory system, the CPU has wait. If the CPU is slow, as compared to the memory system, then the overall performance drops, as the CPU won't be able to process the data fast enough.

As users start up applications, the computer instructions that define the application will start loading from the disk into RAM. As applications start to execute in the CPU, they need to read data from disk. At this point, the I/O capability and speed are very important. This is typically measured in bytes of data per second. Modern systems can read or write in the gigabyte/second range. Data sets from storage, at application start up time, can range into the hundreds of gigabytes (10^9 bytes). Since the application typically can't start until a certain amount of data is in memory and quickly accessible by the CPU, it is important to have an efficient storage sub-system and architecture.

The Break Down: Using Nodes, RAM vs. CPU and Memory Configuration

Compute nodes today for HPC are typically referred to as "fat" or "thin" nodes. Although there is not a cutoff between fat and thin, many in the industry use the definition of a fat node to be a computer (enclosure or sheet metal) to have more than four sockets. A thin node would generally be defined as having four or less sockets. A socket can be thought of as where the chip physically gets plugged into the motherboard.

Why do we refer to sockets and not CPUs here? As the design of chips has moved from constantly increasing the clock rate (around 4GHz at the top today) to adding more computational elements on the chip (cores), it can sometimes be confusing when describing the compute power on a given chip. Sun, AMD, and Intel are shipping CPUs with two or more individual cores. Intel also has a four-core version, and AMD will be shipping a four-core processors this year. Sun has already shipped sockets with eight cores, although currently those systems are not aimed at HPC environments.

An important component when looking at application performance is how fast data can move from RAM to the CPU. Over time, this transfer rate has not kept up with the increases in CPU speed. Now, with the mainstreaming of multi-core CPUs, the situation is even worse. Since two or more cores will be running different threads in an application, with those threads demanding data from RAM, in certain cases, the overall performance will suffer in HPC-type applications.

Most modern HPC applications have been re-written in the past 15 years to take advantage of multiple CPUs working together. Since the demands in the HPC community have out-stripped the ability of single CPUs to deliver the desired performance, software developers have re-written their applications to take advantage of multiple CPUs in a single machine. In addition, to getting more scaling of performance outside of a single enclosure, applications have been further enhanced to run across a number of nodes. This is typically referred to as "horizontal" scaling, since the computing environments are a number of thin nodes. Applications which have been re-written to take advantage of a number of cores have different requirements for either CPUs or RAM for maximum performance.

If the amount of memory, or RAM that is need to hold the data for an application, is not sufficient then the operating system moves some of that data to temporary files on disk. When that data is needed, it has to be brought back into main memory to be used--"swapping" in other words. Since writing to/from the hard disk is an order of magnitude slower than to/from main memory, swapping is to be avoided at all costs.

The cost of a thin node is based on a number of factors. When trying to gain the most performance per dollar (the "price/performance ratio"), it is important to investigate whether paying more for RAM is more beneficial than paying more for a faster CPU (assuming all are dual core or beyond). After the actual cost to the vendor of the machine is determined by adding up the individual parts, a margin is applied to come up with the final price. If you look at Sun's Sun Fire X2200 M2, which is a two-socket, dual-core HPC server based on the AMD Opteron processor, we can see that RAM costs contribute from 16-60 percent of the final cost to build the machine. The CPU costs can range from 20-35 percent of the final cost to Sun.

For example, a low-end AMD Opteron dual core processor (model 2210), if purchased separately, would cost (normalized) 1.0. The model 2210 runs at 1.8 Ghz. Moving to a faster processor, the model 2214, running at 2.2 Ghz, would cost 2.36 times as much, for a raw speed gain of 1.22X. Moving to the fastest CPU choice, the model 2218, running at 2.6 Ghz, would cost 4.28 times as much, with a raw performance gain of 1.44X. Note that this is for the processor only, and not for the whole system. The overall increase in system price is less than these values.

Looking at memory configurations, if a base configuration of 2 GB costs the customer 1.0, a 4GB kit currently costs 2.22X, which is almost linear. The 4GB RAM kit uses higher density memory, which allows for larger data sets to be run. This is very important, since reducing swapping is critical to application performance. As stated earlier, swapping due to insufficient memory should be avoided. Thus, customers should spend their capital dollars on making sure that the system has enough memory, before looking at faster CPU speeds.

Real-Life Applications

Looking at an example in the Mechanical Computer Aided Design (MCAE) area, techniques have been developed to use memory efficiently and not allowing the OS to start swapping. For example, in simulating a car crash, running on a single thin node would take significant amounts of memory. To get maximum performance, the entire model, composed of breaking up the solid car definition into small elements ("finite elements"), the material properties, the external forces, and the like, would have to be loaded into memory. Depending on the resolution of the model-- basically, how many elements are created and how many time steps are solved for -- a tremendous amount of memory would have to be used. Applications like this can be solved either on one system, or broken into smaller parts and solved on a number of horizontal systems.

Application developers have been able to write software that runs across a number of systems. The problem is divided into smaller pieces, with each piece placed on a different computer. For example, the hood of the car can be simulated on one system, the engine compartment on another, the roof on a third, and so on. The different systems would need to communicate the boundary conditions to the other systems, which require fast communications. The benefit to this approach is that the simulation can be run in much less time, since a number of CPUs are working together, in parallel. Also, since each node or core only gets a part of the data, the memory requirements are less on each node. In total, the memory requirements will be similar (or slightly more) compared to running on one system, but the time to solution will be considerably less. In an era of high competitiveness in many markets, the time to get the results outweighs the additional cost of purchasing more computer systems.

[Click image to view at full size]

Figure 2: Results from Scaling of an MCAE application.

An example of a crash analysis CAE code is LS-DYNA from Livermore Software Technology (LSTC). Using one of the test codes, called "Neon Refined," customers or systems vendors can run this test case across various machine types, and look at different parameters. Figure 2 shows how an MCAE application scales across multiple machines. The system used for this example was the Sun Fire X2100, which contains one socket per system. By dividing the problem and using the memory associated with each node, the LS-DYNA run is very scalable.

Another example where running an application in a horizontal scaling environment is in the area of Computational Fluid Dynamics (CFD). Since the simulation can be broken up and solved in a parallel manner, the scaling can be very good and even slightly "super-linear" (see Figure 3). Since the multiple CPUs or cores contain more cache than a single core, this typically happens when more data can be held closer to the computational unit. Thus, the overall processing will be faster, since there is less access to main memory.

[Click image to view at full size]

Figure 3: Scaling for a CFD example.

Conclusion

The amount of memory is critical in the overall performance of the system. The general rule is not to skimp on memory purchases and then to buy the fastest CPU available. It is important to investigate whether paying more for RAM is more beneficial than paying more for a faster CPU. In addition, by scaling horizontally, more memory can be addressed, which may result in higher overall performance.