Tips: What to do when the job runtime exceeds max queue time

Option 1) Get the answers faster:

A) Use the fastest library routines.
     The nodes have fast dense linear routines.  E.g. if these routines are
     used in the code to solve systems of linear equations, a large increase
     in speed may be possible by linking with the vendor supplied routines.
     Link with the MKL library rather than non-optimized libraries.

B) Change to a more efficient algorithm. 
     This is the best since you get your answers quicker.  AIT's HPC 
     group can help you with numerical aspects and some algorithm choices, 
     but you would need to supply the modeling knowledge.   

C) Go parallel.
     The code can be recoded using 

     MPI: The program can be rewritten to use MPI. This often takes a long 
          time but usually gives the best performance

     OpenMP: The program can be modified with OpenMP directives to perform
          portions of the  program in parallel, and compiled to use all
          16 processors in a single node.  This is limited to a 16x speedup
          though, and if not done well can even slow down a program.

           

Option 2) Use checkpointing.

     Major production codes are checkpointed.
     In checkpointing, you periodically save the state of the program in 
a restart file. Whenever you run your program it reads the restart file to 
pick up from the last checkpoint.  The advantage of this is that there is no 
limit to the total amount of time you can use. Barring disk crashes or total 
loss of the machine, your total runtime is indefinite, you just keep submitting 
the same job and start from where you left off.  There is overhead associated 
with each checkpoint, and time executed after the last checkpoint is lost 
whenever the job is stopped.  You may want to do this in addition to whatever 
else you do, since as soon as a research code gets faster, the next step is to 
run a larger problem, so you are back up against the queue time restriction.