Friday, September 6, 2013

More Memory.... Speed?

My typical advice for most people using an FEA solver is to, first things first, get off the hard drive.   Most of the time the best way to do this is to find a server which supports a LOT of memory, so almost any job is guaranteed to get off the hard drive.  Unfortunately, I've come to find that with many servers, installing more than a certain amount of memory will cause the memory to run at a slower speed.  A little background:

Mutli-core servers typically have a certain number of memory "channels" allocated to each core to allow them to access the memory installed in a system.  Higher performance servers will reach up to four channels, so a high performance server with two sockets to install two cores into will likely have 2x4=8 channels that access the memory installed in the system.  However, memory with sane prices typically comes in no more than 16GB per dimm (an individual memory board).  Thus this two core system with only one dimm per channel (DPC) would only have 2x4x16GB=128GB of memory available!  Fortunately, it's extremely rare to find a machine that doesn't run the memory at the highest speed currently available (1600 MT/s or 800 mhz) when there's only one dimm per channel.  If each of these cores where occupied by 8 core processers, such as a Xeon E5-2690, and if the typical analysis jobs of an analyst used less than 128GB of total memory even after splitting up the job 2x8=16 ways at the DMP level, then this machine would be ideal for that analysts needs.

It's very common, however, to allow more than one dimm per channel.  Two dimms per channel is fairly common (for 2x4x2=16 memory slots for 16 DIMMS), and 3 is not unheard of.  Unfortunately, in my recent shopping experience, more than two dimms per channel will typically cause the memory controllers to restrict the memory speed to no more than 1066 MT/s, which can be a 15-30% performance hit.  So if one had purchased a dual core system with a maximum of three dimms per channel, it could probably be configured with 2x4x2x16=256GBs of full speed 1600 MT/s memory, or 2x4x3x16=384GBs of lower speed 1066 MT/s.

So what to do?  It depends on whether a typical analysis job fits into the high speed memory size before DMP'ing, and an analysts' budget.  If the job is bigger than the high speed memory size even before DMPing, then it would be worth it to take the speed hit on the memory to avoid taking the speed hit on having to use out-of-core scratch memory.  If the job DMPs well, and budget is no issue, then a machine should be configured to the maximum size that runs at full speed, and the more machines should be purchased.  For a crazy person on an in-between budget, just run the machine with 1/3rd of the memory uninstalled and ready to go if your models get a little too detailed.

Tuesday, August 13, 2013

Life Interrupts

I'm still here, and have greatly enjoyed the interest all my readers have shown in my blog posts.  In the past few months I've had a number of pulls on my time, resulting in a new house, job, baby, and car, not necessarily in that order.  Things are starting to stabilize a bit, and there have been some modest developments in the field of high performance computing that I would like to use this blog to cover.  Stay off the hard drive, and I'll be seeing you soon.

Monday, December 3, 2012

More on DMP vs SMP

To paraphrase one of my professors at  UC Davis, once you understand a problem well enough to formulate an elegant question about what it is you don't understand, you're halfway to figuring it out.  It was with this in mind that I was contacted by one of my readers with some questions about some of the differences between SMP and DMP.  Paraphrasing and obfuscating a bit, our conversation went something like this:

Reader

Could you explain a bit more about the difference between DMP and SMP?  I understand your card sorting analogy, but what does that mean for hardware?  Multi-cores?  Dual CPUS?  Both?

AppliedFEA

The answer depends a bit on your solver and solution type. The executive summary is that SMP makes most things much faster, but DMP makes some things much much faster, so long as you have enough memory to spare. What solver (Abaqus/Nastran/optistruct) And solution type (statics/frequency/transient etc) are you targeting? Depending on the answer to the previous questions DMP may not be available at all, such as if you're running Nastran SOL 106.

Reader

We run a lot of static stress analysis and modes.  We use implicit solvers such as ABAQUS/Standard, Optistruct, or MSC/MARC.  We run some ABAQUS/Explicit.

AppliedFEA


To break outside the card analogy, an FEA solution requires the solution of a sparse stiffness matrix. Depending on the solution type, there are several levels at which the solution can be broken up and solved on more than one processor at a time, given that the system solving the problem has more than one CPU core. At the lowest level, SMP/parallel, only the vector operations are parallelized. Essentially every solution type in Nastran and Optistruct can make use of this low level parallelization.

You're in luck, however, as some of your solution types have the ability to be parallelized at a higher level.
For statics:
In Nastran, the domainsolver card allows a statics (sol 101) job to split with the stat method, which can assign each processor a certain number of grids or degrees of freedom. Abaqus does it's DMP split similarly, albeit solely at the element level.
For modes (eigensolver):
In Nastran SOL 103 you can use the MODES dmp method, which can split the job up at the grids, DOF, or frequency level (ie each process finds modes in a portion of the total frequency range). Abaqus can, again, split things up over the Element level.

If you're finding modes of models with large numbers of degrees of freedom, consider using an an automated multi level substructuring eigen solver instead of the more typical lanczos. Both Optistruct and MSC Nastran can use the university of Texas developed AMLS eigen solver, which can be purchased separately. All three solvers have also developed their own similar solvers, in Optistruct they call it AMSES, in Nastran they call it ACMS, and in Abaqus AMS. Speed increases of 10x or more are not unknown.

So to tie back to my typical advice, just make sure you are off the hard drive, DMP (split the level up at a high level, with each split requiring more memory) until you run out of memory or CPUs, and if there are any CPUs left after you run out of memory, use parallel/SMP to use those to split up the split at a low level.

Friday, November 30, 2012

Friday Night Music: Normal Modes of an acoustic enclosure

At about the 1:10 mark in the below video, a low frequency waveform begins which is very very close to first resonant frequency of the passenger compartment of my first generation Mazda 3 with the windows rolled down.



Needless to say, the effect was very obnoxious, and enjoyable.

One of the characteristics of sound reproduction is that the combination of loudspeakers and acoustic cavities, such as a pair of headphones or an amphitheater, is that there will be a number of frequencies with much higher responses than others.  A sound engineer attempting to accurately reproduce recorded sound will try and counteract these effects, but this process is very time consuming and difficult to accomplish for either every seat in an auditorium or every pair of headphones that could be plugged into an mp3 player.  As a result, much of what a typical person hears in their lifetime is colored by the speakers and headphones that they happen to be listening to.  It's not implausible to think that this has a significant effect on the music that we gravitate to, and could go a long way to explaining why a song you enjoy very much on your headphones your friends might not.  You may have just found a song with particularly nice resonances for the speakers in your car.

One of the unique joys of live music, particularly live vocal performances, is that singers and musicians will naturally tend to gravitate towards the resonant frequencies of the spaces they're performing in, thus adapting their songs to the place they happen to be in.  Even as a fan of EDM, I can see plenty of work ahead for the genre as it tries to use technology to create a more unique performance at each appearance.

Wednesday, October 31, 2012

The need for speed - Part 7 of 7: Optimizing model detail and solution type

This entry will be less specific than other entries, as the number of techniques to reduce the amount of work required of an FEA solver are extraordinarily numerous. 

There are roughly three ways to change a model such that it runs faster
  • reduce the detail level of the model
  • make use of modeling abstractions
  • use aggressive techniques to arrive more quickly at nearly the correct solution

Reduce Detail Level

The typical guidance for how much detail a model needs is to run a convergence study, increasing detail until the result that an analyst cares about is no longer changing as more detail is introduced.  I generally agree, although a smart analyst should still build up a feel for about how much is necessary for certain features that are commonly modeled in their work.  A few other tips:
  • If an analyst is comfortable working with second order elements, they will typically achieve more accurate results with less run time than simply adding more elements
  • If only a portion of the model is of interest, then Abaqus Surface-Surface ties can allow for a faster transition to the less detailed global model

Many situations do not converge to a solution, and will continue to change their result no matter how fine the level of detail, as the elasticity solution is undefined at a point of interest.  These situations are:
  • A perfectly sharp corner
  • A point in space that can see three material models, with empty space counting as one material model.  These are commonly encountered in soldered joints, composite layups, and other situations where bonding unites two dissimilar materials.  For these situations correlation studies using physical test specimens are usually used, and they will indicate the level of detail one should use in a corresponding FEA model
  • A point constraint or point load
Frequency based solutions should aim to have at least 4 elements per half wavelength for standing waves on structure or in air.  The Abaqus documentation provides good guidance for how to do this for acoustic FEA

Modeling Abstractions

Superelements
If a portion of the model is fixed and will not vary throughout the design process, you can create a superelement (or DMIG as they are sometimes referred to in Nastran) to find the structural response at the interface to that portion of the model
   
Submodeling
If the portion of interest in an Abaqus model is much smaller than the global model which determines the loads that go into it, and design changes in a submodel will not effect the global model, a submodel can provide insight into the behavior of the small portion of the model without having to compute the global solution

Modal Techniques
If the modal density of a model is not too high, modal transients or dynamics can save significant run time. This is particularly with the advanced eigensolvers that are becoming more common in solvers such as Abaqus, Nastran, and Optistruct.  Note that modal density refers to how many modes lie within a frequency range of interest, relative to how many degrees of freedom there are in the model. 

Aggressive Techniques

Some of these are a little wild and could lead to inaccurate results.  Be careful.
  • Use the unsymmetric solver in Abaqus when dealing with models which have strong nonlinear behavior.  Certain nonlinearities absolutely require it, others just converge faster with an unsymmetric solver, with fewer steps, even though each step will require slightly more work
  • Use an iterative solver for large compact models, such as engine blocks
  • Reduce convergence criteria for nonlinear analyses.  If a model is nearly converged but not quite you can just call it good enough and move on with these tricks
  • For models where a body is positioned only by contact with friction, try lightly constraining the body with a spring or constraint while converging the contact solution, then removing the constraint at the last step.  It will change the final result, as contact is path dependent
  • Use variable mass scaling in Abaqus explicit to set a minimum stable time increment when there are a few small elements controlling the minimum stable time increment.  The size of the scaling and the number of small elements may effect the accuracy of this technique

Wednesday, October 24, 2012

The need for speed: Part 6 of 7 - Reducing output requests

In order to find a solution to a structural FEA model, an FEA solver must find the displacements that result from the applied loads and displacements.  What an analyst does with this solution for the model, which is now held in memory,will affect how much extra effort the FEA solver must put into finding other solution variables, and to saving all the requested output to disk.  The typical guidance is that excess output requests is a bigger impact on hard drive space than on solution times, but certain solution sequences can generate output so frequently that it can severely slow the solution speed, such as explicit, which will typically have time steps so small (maybe a millionth of a second) that even for short duration events an analyst may only care about the solution result every thousandth step. 

There are two types of output from an FEA solution: primary field variables and calculated variables.  Primary field variables are the actual solution, such as displacements for a structural simulation or temperatures in a heat transfer simulation.  As such, when they are requested they only require writing data to the disk and have a small impact on simulation speed.  The other type of variable, calculated variables, are found by using the solution variables and the properties of elements to back calculate other variables, such as stress and strain in a structural simulation or heat flux in a thermal simulation.  These require actual computation, and as such will have a much bigger impact on simulation speed.

Some practical tips for particular solvers:

Abaqus
Abaqus will only output the output variables that you request.  The most important thing in Abaqus is typically to reduce the frequency of output in a nonlinear analysis and explicit dynamic analysis.  Both of these solutions will find a converged result over and over again, and unless a model is new and being diagnosed these intermediate results typically do not need to be output very frequently.  Make use of the '*output, TIME INTERVAL=' or '*output, Frequency=' keywords to limit the frequency of these outputs.
  
There are two typical types of output in Abaqus: Field and History.  Field is usually used less frequently to output the complete solution, and history is more typicall used for global measures of model condition, such as total kinetic or elastic energy.

While using "*output, variable=preselect" will provide the most typically used variables in most cases, being more specific about which variables, and even which elements or nodes you wish to find solutions at, will provide even more savings

Do not use *restart unless you are fairly certian you need to use, as it will generate an ENORMOUS restart file.

Optistruct
Optistruct has an interesting behavior in that if no output is requested, it will generate the default outputs for you, similar to Abaqus' preselect.  Outside of not requesting things you dont' care about, the most important thing to do in Optistruct is to be careful how frequently a full result from an optimization is output.  This can be set on the card, 'OUTPUT,(OUTPUT TYPE),(FREQUENCY)'.  The best result is usually FL for the frequency, as this will generate the full results for the initial and final design.


Nastran
Similar to Optistruct, take care to request only the results you care about, using both selective choices of output requests in the case control and use of sets to only requests results for regions you care about.

Thursday, October 18, 2012

The Need for Speed - Part 5 of 7: A faster computer: performance metrics

So we've covered how to make the most of the particular hardware we might have.  But what if we were to start fresh and try to decide on what hardware would make the most of our software licenses?  I'll try and cover three typical scenarios that seem to come up and how to make the best of them.

I work at a company with Millions and Billions of Dollars in Revenue and purchasing managers who are either very reasonable and accommodating, or maybe even drunk.  What sort of hardware should I purchase for my department?
So there's a few pieces of good news here.

  1. Your purchasing managers are in fact very reasonable.  FEA software licenses are very expensive, running into the high 5 figures per year for enough to service a small department.  Compute servers capable of fully exploiting these licenses, on the other hand, are typically in the low 5 figures every few years.   Spending the same amount on software and half as much on hardware could easily halve your analysis throughput for a very small dollar savings.
  2. There are fewer options at the high end, which makes decisions about underlying architectures easier.
So on to specifics.  The basic decisions are:

  • Hardware Vendor
  • Host Operating System
  • Processor architecture
  • GPUs
  • Memory type and quantity
  • Hard drive types and RAID type
  • Network Interconnects

Hardware Vendor: If you're the sort of company that doesn't build its own multi-acre server farms, just purchase from a company you've heard of before, one that offers long term support contracts.

Host Operating System: Some people will fight the urge to use Red Hat Enterprise Linux 64-bit.  Don't.  Use Red Hat like everyone else.  Spend your time configuring the load monitoring and dispatch software.

Processor Architecture: The most important things are single core floating-point performance, number of cores, and memory throughput.  AMD's Bulldozer architecture and Itanium are out, as both have lost sight of floating point performance as their main performance metric in recent versions.  Nehalem/Westmere are also out, as their memory bandwidth is somewhat lacking at the higher end of CPU count.  This leaves Sandy/Ivy Bridge.  Buy these.  By all means buy the newest and fastest, as many as will fit in one server board.  They'll even come with Intel's Advanced Vector Extensions, which is better supported by Nastran than GPUs.

GPUS: If you use Abaqus, maybe.  If you primarily use Explicit or the unsymmetric solver, then it will be of no benefit.  Otherwise, go ahead.

Memory Type and Quantity: RDIMMS, until no more can be installed.  300GB, 400GB per server.  Special high speed memory modules seem to be losing interest in the market.

Hard drive types and RAID type:  The host operating system doesn't even really needed to be raided for speed.  Long term storage should be offloaded to whatever your organization typically uses.  The speed and capacity of the high speed RAID array will matter more if the output is large, less if it is not.  Order something that seems large enough and that is standard for your vendor.  In contrast to years past, it just won't matter that much.  SSD, 15K RPM server drive, it's not as big a deal as it once was.

Network Interconnects: Consider Infiniband if you do a large amount of Abaqus Explicit or very large/large number of iterations nonlinear Abaqus/Standard models.  Otherwise Gbit Ethernet should suffice. Some Nastran DMP solutions won't see any difference at all, due to how the domains can be seperated.

So spend away!  And laugh in two years at what an outdated piece of junk you can't believe is still around!

I work in a lab at a University and we just received a grant for THOUSANDS OF DOLLARS of server hardware. 

The advice is fairly similar to above.  The most important thing to do is to fill up one server with memory before buying the next, and if you can only partially fill one, stop there.  Whereas tens of GB  is oftentimes sufficient for CFD cluster nodes, hundreds of GB is more desirable for FEA work.  You may very well find yourself with a single server with only half as much memory as the maximum installed, when you could have had half a a dozen boxes on a shelf with less memory.  Don't worry about it, when your models are big there is absolutely no substitute for a big enough chunk of RAM.  And when they're small, several processes on one machine will have much less trouble talking to each other than over a network.

So in short, follow the below list in order until you run out of money:
  • Buy a dual socket Sandy Bridge server motherboard, case, power supply, and at least one reasonably large hard drive.  Consider whether you could use a GPU.
  • Buy a fast 8-core processor for one of the sockets
  • Buy RDIMMS until you've filled half the memory
  • Maybe buy a GPU?
  • Buy a second 8 core processor
  • Buy more RDIMMS until the memory is full
  • Maybe buy another GPU? 
  • Maybe RAID the hard drive?
  • Buy an identical Server and a Gigabit switch.
  • Buy Two more servers
  • With four servers, start to consider Infiniband, depending on FEA software
That should fairly easily cover the range from a few thousand to 100K or so.

I work for a boss whose bonus is determined by his departments' IT spending.  Our computers are so old we're getting calls from a computer museum in California.


You have two options:

Option 1: Coffee and Crashed Drives
Anything more than two years on any magnetic hard drive is borrowed time.  This time can be significantly reduced if one were to, say, run jobs with far too little memory allotted to them, whilst kicking the case in an attempt to incite a head crash.  When the inevitable happens, tell your boss tales of how if you'd only have enough memory the hard drive wouldn't wear out, plus your computer could run everyone else's jobs and not have to be replaced for a long time.  Your coworkers will now be beholden to you to run their jobs for them, and while they're running you can tell lies about how much it slows down your computer, which is why you're going to get more coffee.

Option 2: Take up a collection (of RAM)
There's a fair chance that some computers in the office are faster than others, and that the computer with the fastest processor has room for more RAM to be installed.  Take as much as can be spared from all the other computers, and load up The Chosen One.  If you're lucky your computer cases will be unlocked and you can actually get at the hardware, and the Chosen One has an X86-64 processor, which has been available since 2007, and have a motherboard that can have more than 4GB of ram installed.  If you're very lucky you'll have a BIOS that will let you boot from a CD-ROM or USB drive, so you can install a free 64-bit Linux operating system alongside Windows XP 32-bit.  If you're very clever you'll then run the old Windows XP inside a virtual machine so a boss passing by will be none the wiser.

And if you get caught, it's always better to ask forgiveness than permission.