Last updates: Tue May 8 19:16:06 2001 Fri Nov 12 15:26:10 2004 Thu Nov 13 18:30:20 2008 Mon Mar 1 16:28:36 2010
OpenMP
is a relatively new (1997) development
in parallel computing. It is a language-independent
specification of multithreading, and implementations are
available from several vendors, including
OpenMP
is implemented as comments or directives
in Fortran, C, and C++ code, so that its presence is
invisible to compilers lacking OpenMP
support.
Thus, you can develop code that will run everywhere, and
when OpenMP
is available, will run even faster.
The OpenMP
Consortium maintains a very useful
Web site at
http://www.openmp.org/,
with links to vendors and resources.
There is an excellent overview of the advantages of
OpenMP
over POSIX threads ( pthreads
)
and PVM/MPI
in the paper
OpenMP: A Proposed Industry Standard API for Shared
Memory Processors,
also available in
HTML
and
PDF
.
This is a must-read if you are getting started in
parallel programming. It contains two simple examples
programmed with OpenMP
, ptheads
,
and MPI
.
The paper also gives a very convenient tabular comparison of
OpenMP
directives with Silicon Graphics
parallelization directives.
OpenMP
can be used on uniprocessor and
multiprocessor systems with shared memory. It can also be
used in programs that run on homogeneous or heterogeneous
distributed memory environments, which are typically
supported by systems like Linda
, MPI
, and PVM
, although the OpenMP
part of the code will only provide parallelization
on those processors providing shared memory.
In distributed memory environments, the programmer must
manually partition data between processors, and make special
library calls to move the data back and forth. While that
kind of code can also be used in shared memory systems,
OpenMP
is much simpler to program.
Thus, you can start parallelization of an application using
OpenMP
, and then later add MPI
or
PVM
calls: the two forms of parallelization
can peacefully coexist in your program.
An extensive bibliography on multithreading, including
OpenMP
, is available at
http://www.math.utah.edu/pub/tex/bib/index-table-m.html#multithreading.
MPI
and PVM
are covered in a
separate bibliography:
http://www.math.utah.edu/pub/tex/bib/index-table-p.html#pvm
OpenMP
benchmark: computation of pi
This simple benchmark for the computation of pi is taken
from the paper above. Its read
statement has
been modified to read from stdin
instead of the
non-redirectable /dev/tty
, and an extra final
print
statement has been added to show an
accurate value of pi.
Follow this link for the
source code
a
shell script
to run the benchmark, a UNIX
Makefile,
and a small
awk
program
to extract the timing results for inclusion in tables like
the ones below.
OpenMP
directives during compilation:
Compaq/DEC | f90 | -omp |
Compaq/DEC | f95 | -omp |
IBM | xlf90_r | -qsmp=omp -qfixed |
IBM | xlf95_r | -qsmp=omp -qfixed |
PGI | pgf77 | -mp |
PGI | pgf90 | -mp |
PGI | pgcc | -mp |
PGI | pgCC | -mp |
SGI | f77 | -mp |
Once you have compiled with OpenMP
support, the
executable may still not run multithreaded, unless you
preset an environment variable that defines the number of
threads to use. On most of the above systems, this variable
is called OMP_NUM_THREADS. This has no effect on the
IBM systems; I'm still trying to find out what is expected
there.
When the Compaq/DEC benchmark below was run, there was one
other single-CPU-bound process on the machine, so we should
expect to have only 3 available CPUs. As the number of
threads increases beyond the number of available CPUs, we
should expect a performance drop, unless those threads have
idle time, such as from I/O activity. For this simple
benchmark, the loop is completely CPU bound. Evidently,
3 threads make almost perfect use of the machine, at a cost
of only two simple OpenMP
directives added to
the original scalar program.
|
||
|
||
Number of threads | Wallclock Time (sec) | Speedup |
1 | 8.310 | 1.000 |
2 | 4.030 | 2.062 |
3 | 2.780 | 2.989 |
4 | 2.130 | 3.901 |
5 | 3.470 | 2.395 |
6 | 2.930 | 2.836 |
7 | 2.520 | 3.298 |
8 | 2.280 | 3.645 |
|
||
|
||
Number of threads | Wallclock Time (sec) | Speedup |
1 | 6.210 | 1.000 |
2 | 3.110 | 1.997 |
3 | 4.000 | 1.552 |
4 | 4.390 | 1.415 |
|
||
|
||
Number of threads | Wallclock Time (sec) | Speedup |
1 | 28.61 | 1.000 |
2 | 14.33 | 1.997 |
3 | 9.61 | 2.977 |
4 | 7.63 | 3.750 |
5 | 9.79 | 2.922 |
6 | 9.80 | 2.919 |
7 | 9.85 | 2.905 |
8 | 13.15 | 2.176 |
The previous two systems were essentially idle when the benchmark was run, and, as expected, the optimal speedup is obtained when the thead count matches the number of CPUs.
The next one is a large shared system on which the load average was about 40 (that is, about 2/3 busy) when the benchmark was run. With a large number of CPUs, the work per thread is reduced, and eventually, communication and scheduling overhead dominates computation. Consequently, the number of iterations was tripled for this benchmark. Since large tables of numbers are less interesting, the speedup is shown graphically as well. At 100% efficiency, the speedup would be a 45-degree line in the plot. With a machine of this size, it is almost impossible to ever find it idle, though it would be interesting to see how well the benchmark would scale without competition from other users for the CPUs.
|
||
|
||
Number of threads | Wallclock Time (sec) | Speedup |
1 | 32.651 | 1.000 |
2 | 16.348 | 1.997 |
3 | 10.943 | 2.984 |
4 | 8.272 | 3.947 |
5 | 7.178 | 4.549 |
6 | 5.794 | 5.635 |
7 | 4.927 | 6.627 |
8 | 4.446 | 7.344 |
9 | 4.021 | 8.120 |
10 | 3.577 | 9.128 |
11 | 3.409 | 9.578 |
12 | 3.021 | 10.808 |
13 | 2.928 | 11.151 |
14 | 2.645 | 12.344 |
15 | 2.493 | 13.097 |
16 | 2.414 | 13.526 |
17 | 2.208 | 14.788 |
18 | 2.170 | 15.047 |
19 | 2.051 | 15.920 |
20 | 2.051 | 15.920 |
21 | 2.082 | 15.683 |
22 | 1.791 | 18.231 |
23 | 1.824 | 17.901 |
24 | 2.457 | 13.289 |
25 | 2.586 | 12.626 |
26 | 3.134 | 10.418 |
27 | 5.200 | 6.279 |
28 | 5.454 | 5.987 |
29 | 3.431 | 9.516 |
30 | 2.427 | 13.453 |
31 | 3.021 | 10.808 |
32 | 2.418 | 13.503 |
33 | 5.092 | 6.412 |
34 | 7.601 | 4.296 |
35 | 8.790 | 3.715 |
36 | 6.369 | 5.127 |
37 | 6.232 | 5.239 |
38 | 5.588 | 5.843 |
39 | 6.470 | 5.047 |
40 | 7.166 | 4.556 |
41 | 6.218 | 5.251 |
42 | 7.450 | 4.383 |
43 | 6.298 | 5.184 |
44 | 6.475 | 5.043 |
45 | 15.411 | 2.119 |
46 | 7.466 | 4.373 |
47 | 8.293 | 3.937 |
48 | 6.872 | 4.751 |
49 | 8.884 | 3.675 |
50 | 8.006 | 4.078 |
51 | 9.614 | 3.396 |
52 | 25.223 | 1.294 |
53 | 10.789 | 3.026 |
54 | 32.958 | 0.991 |
55 | 35.816 | 0.912 |
56 | 36.213 | 0.902 |
57 | 8.301 | 3.933 |
58 | 11.487 | 2.842 |
59 | 71.526 | 0.456 |
60 | 10.361 | 3.151 |
61 | 52.518 | 0.622 |
62 | 33.081 | 0.987 |
63 | 32.493 | 1.005 |
64 | 95.322 | 0.343 |
(4 EV6 21264 CPUs, 500 MHz, 4GB RAM) OSF/1 4.0F |
||
|
||
Number of threads | Wallclock Time (sec) | Speedup |
1 | 26.470 | 1.000 |
2 | 13.260 | 1.996 |
3 | 8.840 | 2.994 |
4 | 6.650 | 3.980 |
5 | 8.080 | 3.276 |
6 | 6.770 | 3.910 |
7 | 6.850 | 3.864 |
8 | 6.670 | 3.969 |
9 | 7.200 | 3.676 |
10 | 7.130 | 3.712 |
11 | 7.120 | 3.718 |
12 | 6.690 | 3.957 |
13 | 7.180 | 3.687 |
14 | 7.300 | 3.626 |
15 | 7.170 | 3.692 |
16 | 6.710 | 3.945 |
(32 EV6.7 21264A CPUs, 667 MHz, 8GB RAM) |
||
|
||
Number of threads | Wallclock Time (sec) | Speedup |
1 | 2.500 | 1.000 |
2 | 1.600 | 1.562 |
3 | 1.300 | 1.923 |
4 | 1.500 | 1.667 |
5 | 2.000 | 1.250 |
6 | 2.000 | 1.250 |
7 | 1.800 | 1.389 |
8 | 1.200 | 2.083 |
9 | 1.500 | 1.667 |
10 | 1.900 | 1.316 |
11 | 1.900 | 1.316 |
12 | 1.900 | 1.316 |
13 | 3.200 | 0.781 |
14 | 2.400 | 1.042 |
15 | 1.900 | 1.316 |
16 | 2.200 | 1.136 |
17 | 1.900 | 1.316 |
18 | 1.800 | 1.389 |
19 | 2.100 | 1.190 |
20 | 1.600 | 1.562 |
21 | 2.600 | 0.962 |
22 | 1.500 | 1.667 |
23 | 1.800 | 1.389 |
24 | 1.600 | 1.562 |
25 | 1.500 | 1.667 |
26 | 2.100 | 1.190 |
27 | 1.800 | 1.389 |
28 | 1.700 | 1.471 |
29 | 2.200 | 1.136 |
30 | 2.400 | 1.042 |
31 | 2.100 | 1.190 |
32 | 2.500 | 1.000 |
33 | 2.500 | 1.000 |
34 | 1.900 | 1.316 |
35 | 1.800 | 1.389 |
36 | 2.500 | 1.000 |
37 | 1.600 | 1.562 |
38 | 1.600 | 1.562 |
39 | 2.200 | 1.136 |
40 | 2.500 | 1.000 |
41 | 2.200 | 1.136 |
42 | 1.500 | 1.667 |
43 | 3.100 | 0.806 |
44 | 2.400 | 1.042 |
45 | 2.500 | 1.000 |
46 | 2.400 | 1.042 |
47 | 2.500 | 1.000 |
48 | 1.600 | 1.562 |
49 | 3.300 | 0.758 |
50 | 2.200 | 1.136 |
51 | 2.600 | 0.962 |
52 | 3.200 | 0.781 |
53 | 2.400 | 1.042 |
54 | 1.800 | 1.389 |
55 | 3.000 | 0.833 |
56 | 4.900 | 0.510 |
57 | 1.800 | 1.389 |
58 | 2.700 | 0.926 |
59 | 3.100 | 0.806 |
60 | 2.700 | 0.926 |
61 | 3.600 | 0.694 |
62 | 3.000 | 0.833 |
63 | 2.300 | 1.087 |
64 | 3.700 | 0.676 |
(two 8-core CPUs, 128 threads, 1200 MHz UltraSPARC T2 Plus, 64GB RAM) Solaris 10 |
|
|
|
|
|
|
(4 CPUs, 16 threads/CPU) GNU/Linux |
|