AppliedFEA: October 2012

Wednesday, October 31, 2012

The need for speed - Part 7 of 7: Optimizing model detail and solution type

This entry will be less specific than other entries, as the number of techniques to reduce the amount of work required of an FEA solver are extraordinarily numerous.

There are roughly three ways to change a model such that it runs faster

reduce the detail level of the model
make use of modeling abstractions
use aggressive techniques to arrive more quickly at nearly the correct solution

Reduce Detail Level

The typical guidance for how much detail a model needs is to run a convergence study, increasing detail until the result that an analyst cares about is no longer changing as more detail is introduced. I generally agree, although a smart analyst should still build up a feel for about how much is necessary for certain features that are commonly modeled in their work. A few other tips:

If an analyst is comfortable working with second order elements, they will typically achieve more accurate results with less run time than simply adding more elements
If only a portion of the model is of interest, then Abaqus Surface-Surface ties can allow for a faster transition to the less detailed global model

Many situations do not converge to a solution, and will continue to change their result no matter how fine the level of detail, as the elasticity solution is undefined at a point of interest. These situations are:

A perfectly sharp corner
A point in space that can see three material models, with empty space counting as one material model. These are commonly encountered in soldered joints, composite layups, and other situations where bonding unites two dissimilar materials. For these situations correlation studies using physical test specimens are usually used, and they will indicate the level of detail one should use in a corresponding FEA model
A point constraint or point load

Frequency based solutions should aim to have at least 4 elements per half wavelength for standing waves on structure or in air. The Abaqus documentation provides good guidance for how to do this for acoustic FEA

Modeling Abstractions

Superelements
If a portion of the model is fixed and will not vary throughout the design process, you can create a superelement (or DMIG as they are sometimes referred to in Nastran) to find the structural response at the interface to that portion of the model

Submodeling
If the portion of interest in an Abaqus model is much smaller than the global model which determines the loads that go into it, and design changes in a submodel will not effect the global model, a submodel can provide insight into the behavior of the small portion of the model without having to compute the global solution

Modal Techniques
If the modal density of a model is not too high, modal transients or dynamics can save significant run time. This is particularly with the advanced eigensolvers that are becoming more common in solvers such as Abaqus, Nastran, and Optistruct. Note that modal density refers to how many modes lie within a frequency range of interest, relative to how many degrees of freedom there are in the model.

Aggressive Techniques

Some of these are a little wild and could lead to inaccurate results. Be careful.

Use the unsymmetric solver in Abaqus when dealing with models which have strong nonlinear behavior. Certain nonlinearities absolutely require it, others just converge faster with an unsymmetric solver, with fewer steps, even though each step will require slightly more work
Use an iterative solver for large compact models, such as engine blocks
Reduce convergence criteria for nonlinear analyses. If a model is nearly converged but not quite you can just call it good enough and move on with these tricks
For models where a body is positioned only by contact with friction, try lightly constraining the body with a spring or constraint while converging the contact solution, then removing the constraint at the last step. It will change the final result, as contact is path dependent
Use variable mass scaling in Abaqus explicit to set a minimum stable time increment when there are a few small elements controlling the minimum stable time increment. The size of the scaling and the number of small elements may effect the accuracy of this technique

Wednesday, October 24, 2012

The need for speed: Part 6 of 7 - Reducing output requests

In order to find a solution to a structural FEA model, an FEA solver must find the displacements that result from the applied loads and displacements. What an analyst does with this solution for the model, which is now held in memory,will affect how much extra effort the FEA solver must put into finding other solution variables, and to saving all the requested output to disk. The typical guidance is that excess output requests is a bigger impact on hard drive space than on solution times, but certain solution sequences can generate output so frequently that it can severely slow the solution speed, such as explicit, which will typically have time steps so small (maybe a millionth of a second) that even for short duration events an analyst may only care about the solution result every thousandth step.

There are two types of output from an FEA solution: primary field variables and calculated variables. Primary field variables are the actual solution, such as displacements for a structural simulation or temperatures in a heat transfer simulation. As such, when they are requested they only require writing data to the disk and have a small impact on simulation speed. The other type of variable, calculated variables, are found by using the solution variables and the properties of elements to back calculate other variables, such as stress and strain in a structural simulation or heat flux in a thermal simulation. These require actual computation, and as such will have a much bigger impact on simulation speed.

Some practical tips for particular solvers:

Abaqus
Abaqus will only output the output variables that you request. The most important thing in Abaqus is typically to reduce the frequency of output in a nonlinear analysis and explicit dynamic analysis. Both of these solutions will find a converged result over and over again, and unless a model is new and being diagnosed these intermediate results typically do not need to be output very frequently. Make use of the '*output, TIME INTERVAL=' or '*output, Frequency=' keywords to limit the frequency of these outputs.

There are two typical types of output in Abaqus: Field and History. Field is usually used less frequently to output the complete solution, and history is more typicall used for global measures of model condition, such as total kinetic or elastic energy.

While using "*output, variable=preselect" will provide the most typically used variables in most cases, being more specific about which variables, and even which elements or nodes you wish to find solutions at, will provide even more savings

Do not use *restart unless you are fairly certian you need to use, as it will generate an ENORMOUS restart file.

Optistruct
Optistruct has an interesting behavior in that if no output is requested, it will generate the default outputs for you, similar to Abaqus' preselect. Outside of not requesting things you dont' care about, the most important thing to do in Optistruct is to be careful how frequently a full result from an optimization is output. This can be set on the card, 'OUTPUT,(OUTPUT TYPE),(FREQUENCY)'. The best result is usually FL for the frequency, as this will generate the full results for the initial and final design.

Nastran
Similar to Optistruct, take care to request only the results you care about, using both selective choices of output requests in the case control and use of sets to only requests results for regions you care about.

Thursday, October 18, 2012

The Need for Speed - Part 5 of 7: A faster computer: performance metrics

So we've covered how to make the most of the particular hardware we might have. But what if we were to start fresh and try to decide on what hardware would make the most of our software licenses? I'll try and cover three typical scenarios that seem to come up and how to make the best of them.

I work at a company with Millions and Billions of Dollars in Revenue and purchasing managers who are either very reasonable and accommodating, or maybe even drunk. What sort of hardware should I purchase for my department?
So there's a few pieces of good news here.

Your purchasing managers are in fact very reasonable. FEA software licenses are very expensive, running into the high 5 figures per year for enough to service a small department. Compute servers capable of fully exploiting these licenses, on the other hand, are typically in the low 5 figures every few years. Spending the same amount on software and half as much on hardware could easily halve your analysis throughput for a very small dollar savings.
There are fewer options at the high end, which makes decisions about underlying architectures easier.

So on to specifics. The basic decisions are:

Hardware Vendor
Host Operating System
Processor architecture
GPUs
Memory type and quantity
Hard drive types and RAID type
Network Interconnects

Hardware Vendor: If you're the sort of company that doesn't build its own multi-acre server farms, just purchase from a company you've heard of before, one that offers long term support contracts.

Host Operating System: Some people will fight the urge to use Red Hat Enterprise Linux 64-bit. Don't. Use Red Hat like everyone else. Spend your time configuring the load monitoring and dispatch software.

Processor Architecture: The most important things are single core floating-point performance, number of cores, and memory throughput. AMD's Bulldozer architecture and Itanium are out, as both have lost sight of floating point performance as their main performance metric in recent versions. Nehalem/Westmere are also out, as their memory bandwidth is somewhat lacking at the higher end of CPU count. This leaves Sandy/Ivy Bridge. Buy these. By all means buy the newest and fastest, as many as will fit in one server board. They'll even come with Intel's Advanced Vector Extensions, which is better supported by Nastran than GPUs.

GPUS: If you use Abaqus, maybe. If you primarily use Explicit or the unsymmetric solver, then it will be of no benefit. Otherwise, go ahead.

Memory Type and Quantity: RDIMMS, until no more can be installed. 300GB, 400GB per server. Special high speed memory modules seem to be losing interest in the market.

Hard drive types and RAID type: The host operating system doesn't even really needed to be raided for speed. Long term storage should be offloaded to whatever your organization typically uses. The speed and capacity of the high speed RAID array will matter more if the output is large, less if it is not. Order something that seems large enough and that is standard for your vendor. In contrast to years past, it just won't matter that much. SSD, 15K RPM server drive, it's not as big a deal as it once was.

Network Interconnects: Consider Infiniband if you do a large amount of Abaqus Explicit or very large/large number of iterations nonlinear Abaqus/Standard models. Otherwise Gbit Ethernet should suffice. Some Nastran DMP solutions won't see any difference at all, due to how the domains can be seperated.

So spend away! And laugh in two years at what an outdated piece of junk you can't believe is still around!

I work in a lab at a University and we just received a grant for THOUSANDS OF DOLLARS of server hardware.

The advice is fairly similar to above. The most important thing to do is to fill up one server with memory before buying the next, and if you can only partially fill one, stop there. Whereas tens of GB is oftentimes sufficient for CFD cluster nodes, hundreds of GB is more desirable for FEA work. You may very well find yourself with a single server with only half as much memory as the maximum installed, when you could have had half a a dozen boxes on a shelf with less memory. Don't worry about it, when your models are big there is absolutely no substitute for a big enough chunk of RAM. And when they're small, several processes on one machine will have much less trouble talking to each other than over a network.

So in short, follow the below list in order until you run out of money:

Buy a dual socket Sandy Bridge server motherboard, case, power supply, and at least one reasonably large hard drive. Consider whether you could use a GPU.
Buy a fast 8-core processor for one of the sockets
Buy RDIMMS until you've filled half the memory
Maybe buy a GPU?
Buy a second 8 core processor
Buy more RDIMMS until the memory is full
Maybe buy another GPU?
Maybe RAID the hard drive?
Buy an identical Server and a Gigabit switch.
Buy Two more servers
With four servers, start to consider Infiniband, depending on FEA software

That should fairly easily cover the range from a few thousand to 100K or so.

I work for a boss whose bonus is determined by his departments' IT spending. Our computers are so old we're getting calls from a computer museum in California.

You have two options:

Option 1: Coffee and Crashed Drives
Anything more than two years on any magnetic hard drive is borrowed time. This time can be significantly reduced if one were to, say, run jobs with far too little memory allotted to them, whilst kicking the case in an attempt to incite a head crash. When the inevitable happens, tell your boss tales of how if you'd only have enough memory the hard drive wouldn't wear out, plus your computer could run everyone else's jobs and not have to be replaced for a long time. Your coworkers will now be beholden to you to run their jobs for them, and while they're running you can tell lies about how much it slows down your computer, which is why you're going to get more coffee.

Option 2: Take up a collection (of RAM)
There's a fair chance that some computers in the office are faster than others, and that the computer with the fastest processor has room for more RAM to be installed. Take as much as can be spared from all the other computers, and load up The Chosen One. If you're lucky your computer cases will be unlocked and you can actually get at the hardware, and the Chosen One has an X86-64 processor, which has been available since 2007, and have a motherboard that can have more than 4GB of ram installed. If you're very lucky you'll have a BIOS that will let you boot from a CD-ROM or USB drive, so you can install a free 64-bit Linux operating system alongside Windows XP 32-bit. If you're very clever you'll then run the old Windows XP inside a virtual machine so a boss passing by will be none the wiser.

And if you get caught, it's always better to ask forgiveness than permission.

Friday, October 12, 2012

Friday Night Podcast - Sounds of the Artificial World

I occassionally listen to the 99% Invisible podcast.is It covers a number of interesting topics, mostly focused on civil engineering, but one of the earliest ones focused on something different: the sounds that create the feel of computer programs.

Designing the interface of a computer program so that it creates an intuitive, physical connection with users is very difficult. Unlike old stereo equipment, a computer program cannot give real tactile feedback when a dial has been turned all the way or when a switch has been turned on or off. A display can do a very slick job of trying to appear real, or skeuomorphic, but it is still a picture under glass. Most cell phones and video games can give some shaking to create crude feedback, which can tell you that something happened, but it can't tell you much about what that something specifically was. The only remaining way for a computer to give intuitive feedback is the sounds that a program makes. Think about the Windows Critical Stop sound or the early mac equivalent uh-oh sound; if you have a strong emotional reaction to something your computer does its probably to these sorts of sounds. So it's not surprising that smart interface designers will spend a significant amount of time honing the sound effects in their programs.

Something I thought was especially interesting, when listening to the the podcast, was that the sort of sounds that users gravitated to most were the ones that were the recorded sounds of something mechanical happening, such as a vise grip being released, rather than something synthesized. Even when using a device that would only make a big mechanical sound if it was being destroyed, people still enjoyed the unsynthesized sounds of real things happening.

Of course, things do have a way of coming full circle.

Wednesday, October 10, 2012

The Need for Speed - Part 4 of 7: Performance tuning parameters: Nastran

Previously I've covered how to find the optimal memory and cpu settings for Abaqus and Optistruct. While the procedure for those FEA solvers may have seemed complicated, MSC Nastran (and for the most part its spinoff Nx Nastran) take this complexity to a whole new level. The fundamental ideas are the same:

Get off the hard drive
DMP until you run out of RAM
SMP until you run out of processors

While DMP and SMP are no more difficult than Abaqus or Optistruct, getting off the hard drive is much harder because, in my experience, Nastran does not do a good job of telling you what it needs. Whereas Abaqus and Optistruct will tell you exactly what they need, so the memory settings are easy to set, Nastran will only provide partial guidance, particularly when memory settings exceed 32-bit limits. Further, Nastran does not dynamically allocate memory, so unlike Abaqus if one overcommits memory Nastrann will still make it unavailable to the computer even if Nastran is not using it. Not only that, Nastran will typically output memory values in terms of 'words.' A word is a chunk of memory large enough to contain a single number and can be either 4 bytes or 8 bytes large. This makes reading diagnostic files more tedious as one has to multiply a word value by the bytes/word to find the bytes value.

Limits of Nastran estimate Procedures

If you follow the official recommendations, there are in fact two ways to estimate the memory needed to not use the hard drive. The first method is to use the 'estimate' program. It will generate output as seen below:
Reading input file "./.bdf"
...
Submit with command line arguments:
   memory=532.9mb

Estimated Resource Requirements on a 32-bit system:
Memory:                   532.9 MB
Disk:                    9125.4 MB
DBALL:                   6099.1 MB
SCRATCH:                 3049.6 MB
SCR300:                   845.3 MB
SMEM:                      25.0 MB

Well this looks very helpful, except it's very very wrong. This is the output for a model I know used memory=33GB of memory to run at full speed. Estimate didn't even try to recommend an aggressive smemory setting to reduce the amount of data written to the Scratch and Scr300 files.

The other recommended way to find the optimal memory settings is to run the model until the .f04 file displays UIM (User Information Message) 4157.

*** USER INFORMATION MESSAGE 4157 (DFMSYM)
     PARAMETERS FOR PARALLEL SPARSE DECOMPOSITION OF DATA BLOCK SCRATCH ( TYPE=CSP ) FOLLOW
                      MATRIX SIZE =   1163947 ROWS             NUMBER OF NONZEROES = 29408809 TERMS
           NUMBER OF ZERO COLUMNS =         0        NUMBER OF ZERO DIAGONAL TERMS =         0
                     SYSTEM (107) =     32770                      REQUESTED PROC. =         2 CPUS
           ELIMINATION TREE DEPTH =      8372
                CPU TIME ESTIMATE =       915 SEC                I/O TIME ESTIMATE =         3 SEC
       MINIMUM MEMORY REQUIREMENT =     36169 K WORDS             MEMORY AVAILABLE =   2807326 K WORDS
     MEMORY REQR'D TO AVOID SPILL =     36432 K WORDS         MEMORY USED BY BEND =     36170 K WORDS
     EST. INTEGER WORDS IN FACTOR =    143016 K WORDS           EST. NONZERO TERMS =    299087 K TERMS
     ESTIMATED MAXIMUM FRONT SIZE =      2429 TERMS                 RANK OF UPDATE =        64

If we examine the .log file for this .f04, we see that mode=i8 was chosen, meaning each word is 8 bytes. Multiplying the Memory Reqr'd to avoid spill, aka the estimate for in-core memory, we find that UIM 4157 estimates that 36.432 megawords x 8 bytes/word=291 Megabytes of memory will be required of the 2.807 Gigawords x 8 bytes/word=22.46 Gigabytes made available to hicore by the difference between the mem and smem memory settings. So what did this model actually use? If we read the Total Memory and Disk Usage Statistics section in the .f04 file, we see the following useful output data:

*** TOTAL MEMORY AND DISK USAGE STATISTICS ***

+---------- SPARSE SOLUTION MODULES -----------+
    HIWATER               SUB_DMAP        DMAP
    (WORDS)   DAY_TIME      NAME         MODULE
2185800160   16:44:56    FREQRS    655 FRRD1

...

+------------------------------ DBSET FILES -----------------------------+
FILE      ALLOCATED ALLOCATED    HIWATER       HIWATER I/O TRANSFERRED
            (BLOCKS)       (GB)   (BLOCKS)          (GB)             (GB)

MASTER         5000       2.44        155         0.076            0.093
DBALL       2000000     976.56          3         0.001            0.004
OBJSCR         5000       0.31        387         0.024            0.032
(MEMFILE       32768      16.00      31291        15.279            0.000)
SCRATCH     2000000     976.56          1         0.000            0.000
SCR300      2000000     976.56          1         0.000            0.000
                                                           ==============
                                                    TOTAL:          0.130
This model actually reached a hiwater of 17.48 GB, which means that UIM 4157 was off by a factor of roughly 60. Therefore I usually find little value in UIM 4157.

A practical procedure for finding optimal memory settings in Nastran

I therefore propose a two step process for finding what Nastran actually needs

Give Nastran as much memory as your system has
Find out what Nastran actually used

How much memory you can set on the memory command will be dependent on machine and system. If you have a 32-bit machine or operating system, such as standard Windows XP, then the maximum will be 2 GB. If you are on a server you can set the mode=i8 flag and can use as much memory as is installed in the machine. The following will assume that a user knows how much memory is physically available in a system, the installed memory minus the operating system overhead and that used by other users.

The memory settings in Nastran are:

memory=XGB, where X is the memory in Gigabytes. The total memory allocated
smemory=XGB, where X is the smemory in Gigabytes. Smemory is short for Scratch Memory. Remember that the memory available to hicore, or the main solver, is mem-smem. Therefore smem must always be smaller than mem, and increasing smem without increasing mem will reduce the amount of data available to the main solver
mode=i4 (default) or i8. This sets 4 bytes/word or 8 bytes/word. Set it to 8 bytes word on any 64 bit machine with more than 16GB of memory so that Nastran can take full advantage of the memory in the machine. Note that if a model requires memory<16GB when mode was set to i8, you can take that memory required, divide it by half, and set mode=i4.

So to begin, set memory to as high as it can be set on the machine, depending on how much can be set on the machine. Then set smemory to about half of memory, as a large amount of scratch data is typically needed by a Nastran analysis. Try the following settings, depending on machine. The smallest machine I'm considering has at least 2GB of physical memory available.

32-bit machines : memory=2GB smemory=1GB
64-bit machine with 8GB of memory: memory=6GB smemory=3GB mode=i4 (Try not to use more than ~80% of system memory)
64-bit machine with 8GB to 16GB of memory: memory=8GB smemory=4GB mode=i4
64-bit machine with >16GB of memory: memory=(.8xtotal installed memory) smemory=(memory/2) mode=i8

Run your job until it completes. Check the total memory and disk usage statistics. The line MEMFILE will tell you how much smemory you actually need. If the Hiwater on Scratch and SCR300 is zero then congratulations, your machine had enough memory to run this job without the hard drive slowing it down.

+------------------------------ DBSET FILES -----------------------------+
FILE      ALLOCATED ALLOCATED    HIWATER       HIWATER I/O TRANSFERRED
            (BLOCKS)       (GB)   (BLOCKS)          (GB)             (GB)

MASTER         5000       2.44        155         0.076            0.093
DBALL       2000000     976.56          3         0.001            0.004
OBJSCR         5000       0.31        387         0.024            0.032
(MEMFILE       32768      16.00      31291        15.279            0.000)
SCRATCH     2000000     976.56          1         0.000            0.000
SCR300      2000000     976.56          1         0.000            0.000
                                                           ==============
                                                    TOTAL:          0.130
For instance, this job required 15.28 of the 16GB allocated to Smemory.

Then check the hiwater. This will tell you how much hicore (memory-smemory) you need.
*** TOTAL MEMORY AND DISK USAGE STATISTICS ***

+---------- SPARSE SOLUTION MODULES -----------+
    HIWATER               SUB_DMAP        DMAP
    (WORDS)   DAY_TIME      NAME         MODULE
2185800160   16:44:56    FREQRS    655 FRRD1

This particular job was run with mode=i8, and needs 2,185,800,160 words x 8 bytes/word=17.48GB of hicore. Therefore this job, rerun, would use memory=17.48GB + 15.28GB=32.76GB and smemory=15.28GB, with mode=i8. Note that if hicore>hiwater, that means that not enough memory-smemory was set, and that this amount should be increased if it can.

Setting SMP and DMP
Nastran has a long history of SMP type parallelization. Use either the 'SMP' or 'parallel' keywords to set parallel execution, which will work for nearly any modern installation and will use no more RAM.

DMP can also be used in solutions 101,103,108,110,111, and 112. Check the domainsolver executive control statement for more details on how to divide up the problem. When executing, use the 'DMP' keyword to set the number of domains. If there are other machines on the network that have the same version as Nastran and can accept the rsh command, then they can also be used to run Nastran in DMP. Use the hosts=node1:node2:node3:node4 keyword. More details on how to configure this can be found in the MSC Nastran Installation and Operations guide, 'Runing Distributed Memory Parallel (DMP) Jobs'. Similar to Optistruct, each DMP requires its own memory allocation, so memory required is Memory x DMP.

GPU is a very new development for Nastran, and unfortunately is only supported in SOL 101.

Friday, October 5, 2012

Friday Video Mash-Up: BeamNG

As I mentioned earlier, BeamNG does an amazing job of simulating physics in video games. Here is a video of it in action.

BeamNG offers some caveats in their description of their video: "If things seem too bouncy, please remember we're using real life spring and damper rates, and it probably looks wrong because nobody ever does anything this extreme with a real car". For comparison, here is a real crash test:

So what do we notice when we compare the two? I see two things:

In the real crash, the metal quickly deforms into a complex shape, and doesn't bounce around much after that, similar to the process of smashing an aluminum can.
In the real crash, there is some bouncing around of the thin metal on the outside of the vehicle. The bouncing of these 'body panels' is visible immediately after impact but quickly stops. By comparison, BeamNG does indeed seem to bouncy.

Let's look at a higher fidelity 'real' computer model and see if it can provide us any insight into what's driving the limitations of BeamNGs' physics. Below is the Abaqus benchmark model e1, representative of a passenger car impacting a wall at 25 mph. It has tens of thousands of points at which the vehicle crash is simulated. The simulation simulates half a second of impact, and the below video of it displays at 1/5 speed.

Same model, but with a cross section cut thruough it

Looking at this model, it appears that the rapid permanent set of the material happens in a much more realistic fashion. This can easily be explained by the much higher levels of detail allowed by a model which does not have to run in real time. But the body panels are still bouncing around! Why is this? For insight, I first looked at the Abaqus input deck to see what was being used to try to stop the body panels from bouncing around. Specifically, what I'm look for is 'damping', the name for what keeps a bell from ringing forever. Unsurprisingly, I found nothing: this model has no damping in it outside of the suspension. Why would this model not include realistic damping to create a more realistic simulation?

The answer will be very familiar to FEA analysts who do significant amounts of work with explicit FEA solvers. Damping of the kind typically found in structures is of a form that is proportional to the deformation of the metal, and as can be seen in the Abaqus Analysis Users's Manual ver 6.10, section 23.1.1, small amounts of this 'stiffness proportional' damping can reduce the stable time increment, i.e. how much time can pass in one simulation step without the FEA model exploding. This reduces the speed of the simulation, and for a relatively minor amount of structural damping can reduce the speed of the simulation by a factor of ten or more. So for an FEA analyst simulating a car crash or cell phone impact, not only is damping not going to have a very large effect over the very short time scales of an impact, it will significantly increase the time it takes to arrive at a solution. This is why damping was left out of this model and why it behaves strangely. Damping can be so hard to add to explicit FEA solvers that a typical approach is to just filter the results after the fact.

Which brings us back to BeamNG. In a recent interview, the developers spoke at length of the challenges faced by video game developers trying to do structural simulation in real time, stating that "Mass-spring systems have very bad stability, they tend to explode, and are very CPU intensive." My guess is that BeamNG may have experimented with more realistic damping rates for more of the vehicle, but after noticing the massive changes in stability in their simulation removed most of it, leaving damping only in the suspension.

Of course, all this is nitpicking, and I wish them the best of luck in their ambitious endeavor. If I may offer some unsolicited advice, I'd suggest that they be the ones to make the first game that uses their technology. All the successful video game engines and rendering software that I can think of were noticed after the developers made an original product that captured the interest of the public. Consider Pixar's RenderMan, the Doom engine, the Unreal engine, the Crytek Cryengine, and many others. These may never have been used as widely had the developers of these engines not used their intimate knowledge of what their technology can do to make something great. Fortunately, it looks like BeamNG is already heading down this path, and I look forward to a new way to use my computer to simulate smashing things.

Wednesday, October 3, 2012

The need for speed: Part 3 of 7 - Performance Tuning Parameters: Abaqus and Optistruct Linear

Previously we've discussed the three main priorities when trying to use the optimal settings for an FEA solver.

Get off the hard drive
DMP until you run out of RAM
SMP until you run out of processors

We also discussed an analogy that we will continue to make use of. Essentially, solving an FEA solution is analogous to sorting a deck of cards on a table. In this analogy, SMP is equivalent to using the same amount of table space (table space being equivalent to memory), but using more people to help sort, with people being equivalent processors. DMP, on the other hand, involves breaking up the table space into seperate parts, either on one table or on a separate table in a different room.

So given what we now know, how do we accomplish the optimal use of computer resources using specific commercial FEA software?

A Note on system limitations
The following blog post assumes a user is already knowledgeable about the limits of the system they are running their FEA simulations on. Techniques for checking available system resources vary considerably between windows and Unix/Linux based environments, and identical hardware may have different limitations based on 32-bit vs. 64-bit operating system limits, among other things. I plan to cover this topic in more detail in a later post.

ABAQUS
Of the three codes I typically cover, Abaqus is the easiest to properly configure for speed. All Abaqus needs to know is how much memory is free on a computer, how many processors are free on a computer, and what machines it can use. There is almost nothing else to set.

To begin, start by doing a datacheck. Assuming you know how much memory and how many cpus are in your computer or server, and that you know how to execute using the command line, start with the command:
abq6101.bat memory="1 gb" datacheck job=jobfile.inp
note that abq6101.bat is what I use on my local machine. On linux servers it will typically be /opt/abaqus/Commands/abq61... depending on version.

Hopefully your job has been carefully prepared and the Abaqus pre-processor will finish without any errors. On linux machines, or windows machines running cygwin, you can monitor the progress of the datacheck run with the tail command. Ex:
tail -n 100 -f jobfile.dat
The important data will be held in a section that reads:

PROCESS      FLOATING PT       MINIMUM MEMORY        MEMORY TO
              OPERATIONS           REQUIRED          MINIMIZE I/O
             PER ITERATION         (MBYTES)           (MBYTES)

     1         1.86E+012              691               5623

For this run, at least 691 mb must be available to the abaqus solver, and no more than 5623 mb needs to be used by it.

As an aside, Abaqus handles DMP in a very different way than Nastran or Optistruct. Whereas Nastran and Optistruct multiply their memory usage by a factor of memory requestedxDMP, Abaqus will report the total memory needed by all the DMP processes. This is because Abaqus has put significant development effort on a style of DMP that's not quite full DMP, one that is not akin to a full box of cards being sorted independently on different tables or in different rooms, but is more like several people at one table, but with significant effort being put into sorting out who gets what chunk of cards ahead of time. In this way, with some effort by the first person to grab the deck of cards, many people can be kept busy on a smaller table, and a single deck of cards can even still be worked on a great deal of tables. When there are several decks of cards to sort this becomes less efficient than the Nastran/Optistruct style, but Abaqus' approach is the best for attacking one deck.

Returning to the memory settings, because the memory total is the true memory total, one need only set it to whatever is the maximum actually available on a computer and Abaqus will handle the rest. Abaqus is even smart enough to only use as much memory as it needs, as long as that is less than what you allocate it.

So what happens if we have a desktop computer with 8GB of ram and a 64-bit operating system with virtual memory, and we try running it with memory ranging from the bare minimum to the maximum?

So what happened? Wasn't adding memory the most important thing? Well here's the big surprise: a modern operating system helps a great deal when you do one of two suboptimal things:

Using less memory than you actually have
Using more memory than you actually have

If you ask for less than you have, the operating system will be smart enough to use the virtual memory of your computer, which is where memory can substitute as ram and vice-versa, to keep your FEA program from using the hard drive as much as it can. If you try to use more memory than your computer has, it will still use the virtual memory, only this time in the other direction. As I've mentioned before, you're still better off using the FEA softwares' own out-of-core scratch memory management, but even if you do something wrong your operating system will usually be there to minimize the damage.

So let's say you now have enough memory, how do I use more cpus? Again, the syntax is simple
abq6101.bat memory="6 gb" cpus=2 job=jobfile.inp
In this case we are requesting 6 GB of memory and 2 cpus. What are the performance benefits from additional cpus? Typically, the first cpu nearly doubles performance, 4 cpus is a little more than triple speed, and so on as the returns diminish. More detail can be found here.

How about Gpus? Although I don't have a system that can take advantage of it, in the latest abaqus 6.12 release it is selected with the syntax
abq6121 memory="6 gb" cpus=2 gpus=1 job=jobfile.inp
The above will use two cpus, 6GB of memory, and 1 GPU (1 video card). Note that there are significant limits on the solvers that gpus can be used on, specifically only the sparse symmetrix solver in implicit solutions. This rules out explicit and extremely nonlinear problems, such as those with coefficients of friction higher than 0.2

Finally, the use of more than one host is enabled in the abaqus_v6.env file. HP-MPI or some other interconnect must be enabled, and infiniband will yield measureably higher performance than 1GBit ethernet. When using extra hosts, check to make sure that rsh commands are enabled on ssh through the use of key based logins. To enable additional hosts, include in the abaqus_v6.env file the following line:
mp_host_list=[['host1',4],['host2',4]]
The first entry is the name of the host on the network, the second the number of cpus that can be used on that host. Any run that uses more than one cpu and has extra hosts available in the abaqus_v6.env file will take advantage of them.

One final thing to remember is that many of the previous memory and cpu settings where originally only set in the abaqus_v6.env file. It may be worth it to check what parameters remain in the file, either from yourself or your adminstrator, as many of them have changed over the last few versions.

Optistruct
Optistruct can be optimized in a similar, by first running a datacheck and then using the correct memory parameters.
If one is in a hurry, one can simply run optistruct with the below settings
radioss -core in -cpu 2 jobfile.fem
If sufficient memory is available to run in-core with minimum disk use, it will use it, and in this example it will use two cpus in SMP style. Note that if insufficient memory is available for in-core it will error out.

If we want to see what a job will need, then we can instead run
radioss -check jobfile.fem
Assuming your model is built well, the .out file will eventually read like this:

MEMORY ESTIMATION INFORMATION :
-------------------------------
Solver Type is: Sparse-Matrix Solver
                  Direct Method

Current Memory (RAM)                                    :     116 MB
Estimated Minimum Memory (RAM) for Minimum Core Solution:     154 MB
Recommended Memory (RAM) for Minimum Core Solution      :     154 MB
Estimated Minimum Memory (RAM) for Out of Core Solution :     189 MB
Recommended Memory (RAM) for Out of Core Solution       :     211 MB
Recommended Memory (RAM) for In-Core Solution           :    1515 MB
Recommended Number of Nodes for OS SPMD Parallel Run    :       1
(Note: Minimum Core Solution Process is Activated.)
(Note: The Minimum Memory Requirement is limited by Assembly Module.)
(Note: Use param,HASHASSM,yes to avoid assembly module memory bottleneck)

DISK SPACE ESTIMATION INFORMATION :
-----------------------------------
Estimated Disk Space for Output Data Files              :      18 MB
Estimated Scratch Disk Space for In-Core Solution       :     208 MB
Estimated Scratch Disk Space for Out of Core Solution   :    1895 MB
Estimated Scratch Disk Space for Minimum Core Solution :    1952 MB

In the above example, we would need a total of 1515+208MB of ram for a maximum reduction in hard drive use. The syntax for this would be
radioss -len 1723 -ramdisk 208 -cpu 2 jobfile.fem
As noted before, additional -cpu cpu's add some measure of speed but do not require more ram.

If a job is of a type that can also DMP, then they can be requested using the syntax:
radioss -len 1723 -ramdisk 208 -cpu 2 -mpi -np 4 jobfile.fem
A few things I've found: hpmpi sometimes has issues, so take care with the documentation to show it the path to the hpmpi files if necessary. Also note that the total memory demands are now 4x1723=6892 MB, and the total number of cpus needed is 4x2=8. Solution sequences that make use of DMP well tend to be direct frequency and lanczos eigensolutions, although eigensolutions will nearly always be faster when using a modern multilevel eigensolver.

Closing Remarks
As we've seen, different solvers can be very different in how they understand the tuning parameters given to them by users. Additionally, there is typically a significant amount to be learned before even solving a model by simply invoking a checkout run. As we will see in the upcoming Nastran entry, checkout runs are unfortunately not as capable in MSC/MD Nastran.