Parallelism beyond a single node (32 CPUs on Lightning3) requires the use of MPI, however MPI requires major changes to an existing program. Two ways exist to get parallelism with a single 32 CPU node can either be obtained with automatic parallelism (the --parallel compiler option) or with OpenMP (the -openmp compiler option). The simplest way to get parallel execution is to add --parallel to your (Intel) compile command. Then issue setenv OMP_NUM_THREADS 32 ./a.out Another simple way to obtain parallelism is by using OpenMP, which can be used to express parallelism on a shared memory machine. Since each of the nodes on Lightning3 is a shared memory machine with 32 processors, OpenMP can be used to obtain parallelism for 32 processors. It requires changes to the program but not nearly so much as MPI. (The gains are generally less than for MPI, but greater than that for automatic parallelism.) E.g. Having the OpenMP directive !OMP$ PARALLEL DO just before do j=2,n-1 do i=2,m-1 a(i,j)=(b(i,j+1)+b(i,j-1)+b(i-1,j)+b(i+1,j)+4.d0*b(i,j))/6.d0 enddo enddo signals to an OpenMP compiler that the j loop can be performed on multiple processors. When run, issue setenv OMP_NUM_THREADS 32 ./a.out and the program will be run with 32 "threads" which can run on each of the two processors. Everything runs on just one thread until the above directive is reached, when each of the threads perform 1/32-th of the work in the j loop. Without the -openmp flag on the compilation step the directive is ignored as a comment. For C and C++, pragmas are used rather than directives. In general, OpenMP programs run the fastest when most of the operations are on data which is "private" rather than "shared". See the standard for the meaning of private and shared data with regard to OpenMP. The Intel compilers on Lightning3 implement version 3.1 of the OpenMP API.