Previously we've discussed the three main priorities when trying to use the optimal settings for an FEA solver.
- Get off the hard drive
- DMP until you run out of RAM
- SMP until you run out of processors
We also discussed an analogy that we will continue to make use of.  Essentially, solving an FEA solution is analogous to sorting a deck of cards on a table.  In this analogy, SMP is equivalent to using the same amount of table space (table space being equivalent to memory), but using more people to help sort, with people being equivalent processors.  DMP, on the other hand, involves breaking up the table space into seperate parts, either on one table or on a separate table in a different room.
So given what we now know, how do we accomplish the optimal use of computer resources using specific commercial FEA software?
A Note on system limitations
The following blog post assumes a user is already knowledgeable about the limits of the system they are running their FEA simulations on.  Techniques for checking available system resources vary considerably between windows and Unix/Linux based environments, and identical hardware may have different limitations based on 32-bit vs. 64-bit operating system limits, among other things.  I plan to cover this topic in more detail in a later post.
ABAQUS 
Of the three codes I typically cover, Abaqus is the easiest to properly configure for speed.  All Abaqus needs to know is how much memory is free on a computer, how many processors are free on a computer, and what machines it can use.  There is almost nothing else to set.
To begin, start by doing a datacheck.  Assuming you know how much memory and how many cpus are in your computer or server, and that you know how to execute using the command line, start with the command:
abq6101.bat memory="1 gb" datacheck job=jobfile.inp
 note that abq6101.bat is what I use on my local machine.  On linux servers it will typically be /opt/abaqus/Commands/abq61... depending on version.
Hopefully your job has been carefully prepared and the Abaqus pre-processor will finish without any errors.  On linux machines, or windows machines running cygwin, you can monitor the progress of the datacheck run with the tail command.  Ex:
tail -n 100 -f jobfile.dat
The important data will be held in a section that reads:
 PROCESS      FLOATING PT       MINIMUM MEMORY        MEMORY TO
              OPERATIONS           REQUIRED          MINIMIZE I/O
             PER ITERATION         (MBYTES)           (MBYTES)
  
     1         1.86E+012              691               5623
For this run, at least 691 mb must be available to the abaqus solver, and no more than 5623 mb needs to be used by it.
As an aside, Abaqus handles DMP in a very different way than Nastran or Optistruct.  Whereas Nastran and Optistruct multiply their memory usage by a factor of memory requestedxDMP, Abaqus will report the total memory needed by all the DMP processes.  This is because Abaqus has put significant development effort on a style of DMP that's not quite full DMP, one that is not akin to a full box of cards being sorted independently on different tables or in different rooms, but is more like several people at one table, but with significant effort being put into sorting out who gets what chunk of cards ahead of time.  In this way, with some effort by the first person to grab the deck of cards, many people can be kept busy on a smaller table, and a single deck of cards can even still be worked on a great deal of tables.  When there are several decks of cards to sort this becomes less efficient than the Nastran/Optistruct style, but Abaqus' approach is the best for attacking one deck.
Returning to the memory settings, because the memory total is the true memory total, one need only set it to whatever is the maximum actually available on a computer and Abaqus will handle the rest.  Abaqus is even smart enough to only use as much memory as it needs, as long as that is less than what you allocate it.
So what happens if we have a desktop computer with 8GB of ram and a 64-bit operating system with virtual memory, and we try running it with memory ranging from the bare minimum to the maximum?
So what happened?  Wasn't adding memory the most important thing?  Well here's the big surprise: a modern operating system helps a great deal when you do one of two suboptimal things:
- Using less memory than you actually have
- Using more memory than you actually have
If you ask for less than you have, the operating system will be smart enough to use the virtual memory of your computer, which is where memory can substitute as ram and vice-versa, to keep your FEA program from using the hard drive as much as it can.  If you try to use more memory than your computer has, it will still use the virtual memory, only this time in the other direction.  As I've mentioned before, you're still better off using the FEA softwares' own out-of-core scratch memory management, but even if you do something wrong your operating system will usually be there to minimize the damage.
So let's say you now have enough memory, how do I use more cpus?  Again, the syntax is simple
abq6101.bat memory="6 gb" cpus=2 job=jobfile.inp
In this case we are requesting 6 GB of memory and 2 cpus.  What are the performance benefits from additional cpus?  Typically, the first cpu nearly doubles performance, 4 cpus is a little more than triple speed, and so on as the returns diminish.  More detail can be found 
here.
How about Gpus?  Although I don't have a system that can take advantage of it, in the latest abaqus 6.12 release it is selected with the syntax 
abq6121 memory="6 gb" cpus=2 gpus=1 job=jobfile.inp
The above will use two cpus, 6GB of memory, and 1 GPU (1 video card).  Note that there are significant limits on the solvers that gpus can be used on, specifically only the sparse symmetrix solver in implicit solutions.  This rules out explicit and extremely nonlinear problems, such as those with coefficients of friction higher than 0.2
Finally, the use of more than one host is enabled in the abaqus_v6.env file.  HP-MPI or some other interconnect must be enabled, and 
infiniband will yield measureably higher performance than 1GBit ethernet.  When using extra hosts, check to make sure that rsh commands are enabled on ssh through the use of 
key based logins. To enable additional hosts, include in the abaqus_v6.env file the following line:
mp_host_list=[['host1',4],['host2',4]]
The first entry is the name of the host on the network, the second the number of cpus that can be used on that host.  Any run that uses more than one cpu and has extra hosts available in the abaqus_v6.env file will take advantage of them. 
One final thing to remember is that many of the previous memory and cpu settings where originally only set in the abaqus_v6.env file.  It may be worth it to check what parameters remain in the file, either from yourself or your adminstrator, as many of them have changed over the last few versions. 
Optistruct
Optistruct can be optimized in a similar, by first running a datacheck and then using the correct memory parameters.
If one is in a hurry, one can simply run optistruct with the below settings
radioss -core in -cpu 2 jobfile.fem
If sufficient memory is available to run in-core with minimum disk use, it will use it, and in this example it will use two cpus in SMP style.  Note that if insufficient memory is available for in-core it will error out.
If we want to see what a job will need, then we can instead run
radioss -check jobfile.fem
Assuming your model is built well, the .out file will eventually read like this:
MEMORY ESTIMATION INFORMATION :
-------------------------------
 Solver Type is:  Sparse-Matrix Solver
                  Direct Method
 Current Memory (RAM)                                    :     116 MB
 Estimated Minimum Memory (RAM) for Minimum Core Solution:     154 MB
 Recommended Memory (RAM) for Minimum Core Solution      :     154 MB
 Estimated Minimum Memory (RAM) for Out of Core Solution :     189 MB
 Recommended Memory (RAM) for Out of Core Solution       :     211 MB
 Recommended Memory (RAM) for In-Core Solution           :    1515 MB
 Recommended Number of Nodes for OS SPMD Parallel Run    :       1
 (Note: Minimum Core Solution Process is Activated.)
 (Note: The Minimum Memory Requirement is limited by Assembly Module.)       
 (Note: Use param,HASHASSM,yes to avoid assembly module memory bottleneck)
DISK SPACE ESTIMATION INFORMATION :
-----------------------------------
 Estimated Disk Space for Output Data Files              :      18 MB
 Estimated Scratch Disk Space for In-Core Solution       :     208 MB
 Estimated Scratch Disk Space for Out of Core Solution   :    1895 MB
 Estimated Scratch Disk Space for Minimum Core Solution  :    1952 MB
In the above example, we would need a total of 1515+208MB of ram for a maximum reduction in hard drive use.  The syntax for this would be
radioss -len 1723 -ramdisk 208 -cpu 2 jobfile.fem
As noted before, additional -cpu cpu's add some measure of speed but do not require more ram.  
If a job is of a type that can also DMP, then they can be requested using the syntax:
radioss -len 1723 -ramdisk 208 -cpu 2 -mpi -np 4 jobfile.fem
A few things I've found: hpmpi sometimes has issues, so take care with the documentation to show it the path to the hpmpi files if necessary.  Also note that the total memory demands are now 4x1723=6892 MB, and the total number of cpus needed is 4x2=8.  Solution sequences that make use of DMP well tend to be direct frequency and lanczos eigensolutions, although eigensolutions will nearly always be faster when using a modern multilevel eigensolver.
Closing Remarks
As we've seen, different solvers can be very different in how they understand the tuning parameters given to them by users.  Additionally, there is typically a significant amount to be learned before even solving a model by simply invoking a checkout run.  As we will see in the upcoming Nastran entry, checkout runs are unfortunately not as capable in MSC/MD Nastran.