- Get off the hard drive
- DMP until you run out of RAM
- Parallel whatever cores you have left
An FEA solver must first create a system of linear equations which describe the FEA problem to be solved. This is sometimes called assembling the stiffness matrix, although other matrices may be assembled, such as mass or damping. These equations are stored in matrix form, and must then be operated on by a mathematical solver in order to arrive at the solution to the FEA problem. In approximate order of increasing memory requirements, the typical solvers seen in structural FEA are:
- Forward Integration time step: explicit transient dynamics, typically short duration transients, such as vehicle impact
- Iterative: statics, linear and nonlinear, typically dense and blocky models, such as an engine block with mild contact nonlinearity
- Gaussian Direct Elimination, non complex: statics, linear and nonlinear, steady state or implicit transient. One of the most common solver types
- Gaussian Direct Elimination, complex : frequency response, the steady state response of a structure when loaded by loads which all have the same forcing frequency, such as a speaker generating a single tone or vehicle NVH analyses across a frequency range
- Eigenvalue solvers, modern methods (automated multilevel substructuring): natural frequency and buckling loads, also used for modal type solutions to direct frequency and transient responses
- Eigenvalue solvers, older methods (Lanczos, etc.)
So what sort of memory should we use to store the system of equations? Given the layout of a typical computer's memory hierarchy, the most practical, high speed place to store the system of equations so that they can be quickly solved is the random access memory or memory of the computer.
At this point those with a background in computers may be wondering what if the system of equations is too large to fit in the free memory of the computer? A typical computer program would have the operating system store whatever didn't fit into virtual memory, or hard drive space that behaves as memory. For most programs, this is the best thing to do if there is insufficient memory. FEA solvers, however, are different. FEA solvers predate operating systems with virtual memory, and have developed their own specialized methods of making due with less than the ideal amount of memory. What an FEA solver does instead is if the memory allocated to it is less than what the system of equations needs, then it will revert from what is known as an in-core solver to an out-of-core solver which will load a portion of the system of equations and solve it before putting it away in a scratch file and loading a new chunk to work on.
So now we have two interesting things to remember:
- The same level of detail may take different amounts of memory to solve depending on solution type
- The solver will need to know how much memory is actually available so it can intelligently decide how to store the system of equations, either in-core or out-of-core
So now we've given a general answer to the first question, how to get off the hard drive: allocate enough memory. Now what?
Now we DMP (distributed memory process).
Solving a large system of equations is analogous to sitting at a table and ordering a shuffled deck of cards. If we have only one deck of cards to sort, and only one desk, and only one person sorting them, then there are no decisions to be made about how to allocate resources. Now let's say you have a second person that can help you, and a completely separate deck of cards to sort. If you have enough space on the desk, then they can spread out the second deck and begin sorting without slowing you down. This is equivalent to a DMP run on a single computer: enough memory (desk space) and enough people (CPUs) and a problem that can be divided up (seperate decks of cards). This is confusing for many at first, including myself, because a computer with more than one CPU that shares the same pool of memory is technically an SMP computer system. So from a computer science perspective, a desk with a piece of tape down the middle is an SMP system doing DMP style work, whereas two separate desks in two separate rooms is like true DMP.
So what sort of FEA solutions lend themselves well to DMP style splitting of workload? Many, in fact, although speedup can be very solver and solution type dependent. DMP can be used by lanczos eigensolvers, and very easily used by direct frequency analyses, where there is a frequency that needs to be found at each frequency and each frequencies' solution can be found independently. It can also be used by ABAQUS for gaussian elimination, iterative, and explicit solvers, although the overhead of 'passing cards' from room to room can begin to dominate the solution times.
So what if you have only one computer, and you don't have enough memory to split the solution up any more? This is where SMP can be of use, and is in fact one of the oldest ways that more than one CPU has been used to accelerate FEA solution times, dating to Nastran running on CRAY supercomputers. SMP is equivalent to having only enough room on the desk to work on the one deck of cards, but more than one person reaching in and working on ordering the cards. As you can imagine, returns quickly diminish for this type of parallelization, but if the CPUs are available there's no reason not to take advantage of it. Nearly any solver and nearly any solution type can take advantage of this style of parallelization, with the notable exception of Abaqus, which has instead focused on parallizing in a manner that divvies up the cards intelligently ahead of time.
There is one final, more exotic, method of SMP that is sometimes available. With the growing general abilities of the vector processors inside of the graphics cards in computers and video game systems, programmers have begun to use them as a new type of person working to sort the deck of cards. Only in this instance, its more like a trained bag of mice, mice which still need a person to bring the deck of cards to the table and spread them out, but can handle much of the hard work from there. This type of acceleration, oftentimes referred to as GPU acceleration, will only become more prevalent as the limits in using more and more transistors to make a single cpu faster become apparent. As of now though, support is very limited, and only the most basic static solution of Nastran and the Gaussian solver, non complex, symmetric solver of Abaqus is supported, and only on post Windows XP operating systems. The latest Intel chips,
Sandy Bridge and later, have a functionality similar to this that is much better supported in Nastran than GPU acceleration.
We've now covered, in very general terms, how to get off the hard drive and make the most of our ram, and having done that make the most of our CPUs and even our graphics cards. Next we will cover solver specific tuning parameters that take advantage of these concepts.
If you're out of memory on a single-processor machine, why not just use virtual memory. Further, there are two ways of doing that: 1)use an out-of-core solver; 2) switch on the OS' virtual memory.
ReplyDeleteWhich of those two alternatives is the better? My sense (based on experiences of others) is that it's usually better to write your own out-of-core solver.
Further complication: What about writing your own out-of-core solver on a virtual-memory machine? Based, again, on others' experiences, it's my sense that writing your own out-of-core solver is more efficient than simply relying on virtual memory, but several computer scientists (as opposed to engineers working on computing problems) is that it would be better to simply rely on the OS' virtual memory.
Any other thoughts from out there?
Claridon
I completely agree that an out-of-core solver will deliver better results than a more general OS level virtual memory when finding an FEA solution that can't fit into memory.
DeleteThe bigger question that's harder to answer is what's the typical speed impact of insufficient memory. If you see part 3 of 7 of this series, you can see that virtual memory does a good job of minimizing the damage when either of two bad things happen:
-minimizing the damage when too little memory was asked for on a system with unused memory, by using memory to accelerate disk access
-minimizing the damage when too much memory was asked for on a system with less than the requested memory, by using disk to simulate memory
This means that on a typical modern OS with virtual memory enabled it can be very tricky to find the impact of having insufficient real memory. Why would this be a great thing to know? Because it's not uncommon to have almost but not quite enough memory to DMP one more way, say on a dual cpu system running a job that wants 55% of the free memory on a system. If the impact of having only 90% of the optimal memory is less than a 50% reduction in speed then it would be worth it, and in that case it probably is. But what about a job that wants 65%? 75%? It would be great to know where the cutoff is, and how much hard drive speed affected it.