CyStorm - Using OpenMP or Auto-parallelism


Parallelism beyond a single node (8 CPUs on CyStorm) requires the use of MPI, 
however MPI requires major changes to an existing program.  There is a simpler way to obatin
parallelism if you limit yourself to the processors on a single compute node.

A simple way to obtain parallelism is to use OpenMP, 
to express parallelism on a shared memory machine.  Since each of the nodes on
CyStorm is a shared memory machine with 8 processors, OpenMP can be used to 
obtain parallelism for 8 processors.  It requires changes to the program but 
not nearly as much as MPI. (The gains are generally less than for MPI, but 
greater than that for compilers which attempt automatic parallelism.)

E.g. 

  Having the OpenMP directive  
!OMP$  PARALLEL DO
just before 

do j=2,n-1
 do i=2,m-1
 a(i,j)=(b(i,j+1)+b(i,j-1)+b(i-1,j)+b(i+1,j)+4.d0*b(i,j))/6.d0
 enddo
enddo

signals to an OpenMP compiler that the j loop can be performed on multiple
processors.

When run, issue

setenv OMP_NUM_THREADS 8 
./a.out

and the program will be run with 8 "threads" which can run on each of the 16 
processors.  Everything runs on just one thread until the above directive is 
reached, when each of the threads performs 1/8-th of the work in the j loop.

Without the -fopenmp flag on the compilation step the directive is ignored as a 
comment.

For C and C++, pragmas are used rather than directives.

In general, OpenMP programs run the fastest when most of the operations are on
data which is "private" rather than "shared". See the OpenMP Specifications for
the meaning of private and shared data with regard to OpenMP.