AppliedFEA: The need for speed: Part 2 of 7 - Performance tuning parameters: Common Topics

What are the most important things to keep in mind when changing the settings for an FEA solver?

Get off the hard drive
DMP until you run out of RAM
Parallel whatever cores you have left

An FEA solver must first create a system of linear equations which describe the FEA problem to be solved. This is sometimes called assembling the stiffness matrix, although other matrices may be assembled, such as mass or damping. These equations are stored in matrix form, and must then be operated on by a mathematical solver in order to arrive at the solution to the FEA problem. In approximate order of increasing memory requirements, the typical solvers seen in structural FEA are:

Forward Integration time step: explicit transient dynamics, typically short duration transients, such as vehicle impact
Iterative: statics, linear and nonlinear, typically dense and blocky models, such as an engine block with mild contact nonlinearity
Gaussian Direct Elimination, non complex: statics, linear and nonlinear, steady state or implicit transient. One of the most common solver types
Gaussian Direct Elimination, complex : frequency response, the steady state response of a structure when loaded by loads which all have the same forcing frequency, such as a speaker generating a single tone or vehicle NVH analyses across a frequency range
Eigenvalue solvers, modern methods (automated multilevel substructuring): natural frequency and buckling loads, also used for modal type solutions to direct frequency and transient responses
Eigenvalue solvers, older methods (Lanczos, etc.)

Already we can see something interesting. Whereas one might think of how 'big' their model is in terms of the number of nodes or elements, the computer sees the model as big as the solution determines it will have to be. A model that runs quickly with particular memory settings on a particular computer when solving a linear statics problem may run much, much slower when one wants to find an associated buckling load.

So what sort of memory should we use to store the system of equations? Given the layout of a typical computer's memory hierarchy, the most practical, high speed place to store the system of equations so that they can be quickly solved is the random access memory or memory of the computer.

At this point those with a background in computers may be wondering what if the system of equations is too large to fit in the free memory of the computer? A typical computer program would have the operating system store whatever didn't fit into virtual memory, or hard drive space that behaves as memory. For most programs, this is the best thing to do if there is insufficient memory. FEA solvers, however, are different. FEA solvers predate operating systems with virtual memory, and have developed their own specialized methods of making due with less than the ideal amount of memory. What an FEA solver does instead is if the memory allocated to it is less than what the system of equations needs, then it will revert from what is known as an in-core solver to an out-of-core solver which will load a portion of the system of equations and solve it before putting it away in a scratch file and loading a new chunk to work on.

So now we have two interesting things to remember:

The same level of detail may take different amounts of memory to solve depending on solution type
The solver will need to know how much memory is actually available so it can intelligently decide how to store the system of equations, either in-core or out-of-core

In later posts, I will cover in more detail how to find how much memory is available in a system, find out how much memory a program estimates it will need to solve a problem, how much memory is required to quickly solve the problem, and what settings to use given the available memory and estimated memory requirements.

So now we've given a general answer to the first question, how to get off the hard drive: allocate enough memory. Now what?

Now we DMP (distributed memory process).

Solving a large system of equations is analogous to sitting at a table and ordering a shuffled deck of cards. If we have only one deck of cards to sort, and only one desk, and only one person sorting them, then there are no decisions to be made about how to allocate resources. Now let's say you have a second person that can help you, and a completely separate deck of cards to sort. If you have enough space on the desk, then they can spread out the second deck and begin sorting without slowing you down. This is equivalent to a DMP run on a single computer: enough memory (desk space) and enough people (CPUs) and a problem that can be divided up (seperate decks of cards). This is confusing for many at first, including myself, because a computer with more than one CPU that shares the same pool of memory is technically an SMP computer system. So from a computer science perspective, a desk with a piece of tape down the middle is an SMP system doing DMP style work, whereas two separate desks in two separate rooms is like true DMP.

So what sort of FEA solutions lend themselves well to DMP style splitting of workload? Many, in fact, although speedup can be very solver and solution type dependent. DMP can be used by lanczos eigensolvers, and very easily used by direct frequency analyses, where there is a frequency that needs to be found at each frequency and each frequencies' solution can be found independently. It can also be used by ABAQUS for gaussian elimination, iterative, and explicit solvers, although the overhead of 'passing cards' from room to room can begin to dominate the solution times.

So what if you have only one computer, and you don't have enough memory to split the solution up any more? This is where SMP can be of use, and is in fact one of the oldest ways that more than one CPU has been used to accelerate FEA solution times, dating to Nastran running on CRAY supercomputers. SMP is equivalent to having only enough room on the desk to work on the one deck of cards, but more than one person reaching in and working on ordering the cards. As you can imagine, returns quickly diminish for this type of parallelization, but if the CPUs are available there's no reason not to take advantage of it. Nearly any solver and nearly any solution type can take advantage of this style of parallelization, with the notable exception of Abaqus, which has instead focused on parallizing in a manner that divvies up the cards intelligently ahead of time.

There is one final, more exotic, method of SMP that is sometimes available. With the growing general abilities of the vector processors inside of the graphics cards in computers and video game systems, programmers have begun to use them as a new type of person working to sort the deck of cards. Only in this instance, its more like a trained bag of mice, mice which still need a person to bring the deck of cards to the table and spread them out, but can handle much of the hard work from there. This type of acceleration, oftentimes referred to as GPU acceleration, will only become more prevalent as the limits in using more and more transistors to make a single cpu faster become apparent. As of now though, support is very limited, and only the most basic static solution of Nastran and the Gaussian solver, non complex, symmetric solver of Abaqus is supported, and only on post Windows XP operating systems. The latest Intel chips,
Sandy Bridge and later, have a functionality similar to this that is much better supported in Nastran than GPU acceleration.

We've now covered, in very general terms, how to get off the hard drive and make the most of our ram, and having done that make the most of our CPUs and even our graphics cards. Next we will cover solver specific tuning parameters that take advantage of these concepts.

2 comments:

ClaridonJanuary 2, 2013 at 8:37 PM
If you're out of memory on a single-processor machine, why not just use virtual memory. Further, there are two ways of doing that: 1)use an out-of-core solver; 2) switch on the OS' virtual memory.

Which of those two alternatives is the better? My sense (based on experiences of others) is that it's usually better to write your own out-of-core solver.

Further complication: What about writing your own out-of-core solver on a virtual-memory machine? Based, again, on others' experiences, it's my sense that writing your own out-of-core solver is more efficient than simply relying on virtual memory, but several computer scientists (as opposed to engineers working on computing problems) is that it would be better to simply rely on the OS' virtual memory.

Any other thoughts from out there?

Claridon

Thursday, September 27, 2012

The need for speed: Part 2 of 7 - Performance tuning parameters: Common Topics

2 comments: