- Get off the hard drive
- DMP until you run out of RAM
- Parallel whatever cores you have left
An FEA solver must first create a system of linear equations which describe the FEA problem to be solved. This is sometimes called assembling the stiffness matrix, although other matrices may be assembled, such as mass or damping. These equations are stored in matrix form, and must then be operated on by a mathematical solver in order to arrive at the solution to the FEA problem. In approximate order of increasing memory requirements, the typical solvers seen in structural FEA are:
- Forward Integration time step: explicit transient dynamics, typically short duration transients, such as vehicle impact
- Iterative: statics, linear and nonlinear, typically dense and blocky models, such as an engine block with mild contact nonlinearity
- Gaussian Direct Elimination, non complex: statics, linear and nonlinear, steady state or implicit transient. One of the most common solver types
- Gaussian Direct Elimination, complex : frequency response, the steady state response of a structure when loaded by loads which all have the same forcing frequency, such as a speaker generating a single tone or vehicle NVH analyses across a frequency range
- Eigenvalue solvers, modern methods (automated multilevel substructuring): natural frequency and buckling loads, also used for modal type solutions to direct frequency and transient responses
- Eigenvalue solvers, older methods (Lanczos, etc.)
So what sort of memory should we use to store the system of equations? Given the layout of a typical computer's memory hierarchy, the most practical, high speed place to store the system of equations so that they can be quickly solved is the random access memory or memory of the computer.
At this point those with a background in computers may be wondering what if the system of equations is too large to fit in the free memory of the computer? A typical computer program would have the operating system store whatever didn't fit into virtual memory, or hard drive space that behaves as memory. For most programs, this is the best thing to do if there is insufficient memory. FEA solvers, however, are different. FEA solvers predate operating systems with virtual memory, and have developed their own specialized methods of making due with less than the ideal amount of memory. What an FEA solver does instead is if the memory allocated to it is less than what the system of equations needs, then it will revert from what is known as an in-core solver to an out-of-core solver which will load a portion of the system of equations and solve it before putting it away in a scratch file and loading a new chunk to work on.
So now we have two interesting things to remember:
- The same level of detail may take different amounts of memory to solve depending on solution type
- The solver will need to know how much memory is actually available so it can intelligently decide how to store the system of equations, either in-core or out-of-core
So now we've given a general answer to the first question, how to get off the hard drive: allocate enough memory. Now what?
Now we DMP (distributed memory process).
Solving a large system of equations is analogous to sitting at a table and ordering a shuffled deck of cards. If we have only one deck of cards to sort, and only one desk, and only one person sorting them, then there are no decisions to be made about how to allocate resources. Now let's say you have a second person that can help you, and a completely separate deck of cards to sort. If you have enough space on the desk, then they can spread out the second deck and begin sorting without slowing you down. This is equivalent to a DMP run on a single computer: enough memory (desk space) and enough people (CPUs) and a problem that can be divided up (seperate decks of cards). This is confusing for many at first, including myself, because a computer with more than one CPU that shares the same pool of memory is technically an SMP computer system. So from a computer science perspective, a desk with a piece of tape down the middle is an SMP system doing DMP style work, whereas two separate desks in two separate rooms is like true DMP.
So what sort of FEA solutions lend themselves well to DMP style splitting of workload? Many, in fact, although speedup can be very solver and solution type dependent. DMP can be used by lanczos eigensolvers, and very easily used by direct frequency analyses, where there is a frequency that needs to be found at each frequency and each frequencies' solution can be found independently. It can also be used by ABAQUS for gaussian elimination, iterative, and explicit solvers, although the overhead of 'passing cards' from room to room can begin to dominate the solution times.
So what if you have only one computer, and you don't have enough memory to split the solution up any more? This is where SMP can be of use, and is in fact one of the oldest ways that more than one CPU has been used to accelerate FEA solution times, dating to Nastran running on CRAY supercomputers. SMP is equivalent to having only enough room on the desk to work on the one deck of cards, but more than one person reaching in and working on ordering the cards. As you can imagine, returns quickly diminish for this type of parallelization, but if the CPUs are available there's no reason not to take advantage of it. Nearly any solver and nearly any solution type can take advantage of this style of parallelization, with the notable exception of Abaqus, which has instead focused on parallizing in a manner that divvies up the cards intelligently ahead of time.
There is one final, more exotic, method of SMP that is sometimes available. With the growing general abilities of the vector processors inside of the graphics cards in computers and video game systems, programmers have begun to use them as a new type of person working to sort the deck of cards. Only in this instance, its more like a trained bag of mice, mice which still need a person to bring the deck of cards to the table and spread them out, but can handle much of the hard work from there. This type of acceleration, oftentimes referred to as GPU acceleration, will only become more prevalent as the limits in using more and more transistors to make a single cpu faster become apparent. As of now though, support is very limited, and only the most basic static solution of Nastran and the Gaussian solver, non complex, symmetric solver of Abaqus is supported, and only on post Windows XP operating systems. The latest Intel chips,
Sandy Bridge and later, have a functionality similar to this that is much better supported in Nastran than GPU acceleration.
We've now covered, in very general terms, how to get off the hard drive and make the most of our ram, and having done that make the most of our CPUs and even our graphics cards. Next we will cover solver specific tuning parameters that take advantage of these concepts.