Tips: What to do when the job runtime exceeds max queue time

Option 1) Get the answers faster:

A) Use the fastest library routines.
      The nodes have fast dense linear routines.  E.g. If these routines are
     used in the code to solve systems of linear equations, a large increase
     in speed may be possible by linking with the vendor supplied routines.
     Link with -lacml rather than non-optimized libraries.

B) Change to a more efficient algorithm. 
      This is the best since you get your answers quicker.  AIT's HPC 
    group can help you with numerical aspects and some algorithm choices, 
    but you would need to supply the modeling knowledge.   

C) Go parallel.
      The code can be recoded using 

    MPI: The program can be rewritten to use MPI. This often takes a long 
         time but usually gives the best performance

    OpenMP: The program can be modified with OpenMP directives to perform
           portions of the  program in parallel, and compiled to use all
           16 processors in a single node.  This is limited to a 16x speedup
           though, and if not done well can even slow down a program.

           

Option 2) Use check-pointing.

       Major production codes are checkpointed.
       In check-pointing, you periodically save the state of the program in 
a restart file. Whenever you run your program it reads the restart file to 
pick up from the last checkpoint.  The advantage of this is that there is no 
limit to the total amount of time you can use. Barring disk crashes or total 
loss of the machine, your total runtime is indefinite, you just keep submitting 
the same job and start from where you left off.  There is overhead associated 
with each checkpoint, and time executed after the last checkpoint is lost 
whenever the job is stopped.  You may want to do this whatever else you do, 
since as soon as a research code gets faster, the next step is to run a larger 
problem, so you are back up against the queue time restriction.