"Linux Gazette...making Linux just a little more fun!"


Linux Benchmarking - Concepts

by André D. Balsa andrewbalsa@usa.net

With corrections and contributions by Uwe F. Mayer mayer@math.vanderbilt.edu and David C. Niemi bench@wauug.erols.com


This is the first article in a series of 4 articles on Linux Benchmarking, to be published by the Linux Gazette. This article deals with the fundamental concepts in computer benchmarking, as they apply to the Linux OS. An example of a classic benchmark, "Whetstone", is analyzed in more detail.

1. Basic concepts and definitions

2. A variety of benchmarks

3. FPU tests: Whetstone and Sons, Ltd.

4. References


1. Basic concepts and definitions

1.1 Benchmark

A benchmark is a documented procedure that will measure the time needed by a computer system to execute a well-defined computing task. It is assumed that this time is related to the performance of the computer system and that someh ow the same procedure can be applied to other systems, so that comparisons can be made between different hardware/software configurations.

1.2 Benchmark results

From the definition of a benchmark, one can easily deduce that there are two basic procedures for benchmarking:

  1. Measuring the time it takes for the system being examined to loop through a fixed number of iterations of a specific piece of code.
  2. Measuring the number of iterations of a specific piece of code executed by the system under examination in a fixed amount of time.

If a single iteration of our test code takes a long time to execute, procedure 1 will be preferred. On the other hand, if the system being tested is able to execute thousands of iterations of our test code per second, procedure 2 should be chosen.

Both procedures 1 and 2 will yield final results in the form "seconds/iteration" or "iterations/second" (these two forms are interchangeable). One could imagine other algorithms, e.g. self-modifying code or measuring the time needed to reach a steady s tate of some sort, but this would increase the complexity of the code and produce results that would probably be next to impossible to analyze and compare.

1.3 Index figures

Sometimes, figures obtained from standard benchmarks on a system being tested are compared with the results obtained on a reference machine. The reference machine's results are called the baseline results. If we divide the results of the system under examination by the baseline results, we obtain a performance index. Obviously, the performance index for the reference machine is 1.0. An index has no units, it is just a relative measurement.

1.4 Performance metrics

The final result of any benchmarking procedure is always a set of numerical results which we can call speed or performance (for that particular aspect of our system effectively tested by the piece of code).

Under certain conditions we can combine results from similar tests or various indices into a single figure, and the term metric will be used to describe the "units" of performance for this benchmarking mix.

1.5 Elapsed wall-clock time vs. CPU time

Time measurements for benchmarking purposes are usually taken by defining a starting time and an ending time, the difference between the two being the elapsed wall-clock time. Wall-clock means we are not considering just CPU time, but the "real" time usually provided by an internal asynchronous real-time clock source in the computer or an external clock source (your wrist-watch for example). Some tests, however, make use of CPU time: the time effectively spent by the CPU of the system being test ed in running the specific benchmark, and not other OS routines.

1.6 Resolution and precision

Resolution and precision both measure the information provided by a data point, but should not be confused.

Resolution is the minimum time interval that can be (easily) measured on a given system. In Linux running on i386 architectures I believe this is 1/100 of a second, provided by the GNU C system library function times (see /usr/include/time .h - not very clear, BTW). Another term used with the same meaning is "granularity". David C. Niemi has developed an interesting technique to lower granularity to very low (sub-millisecond) levels on Linux systems, I hope he will contribute an explanation of his algorithm in the next article.

Precision is a measure of the total variability in the results for any given benchmark. Computers are deterministic systems and should always provide the same, identical benchmark results if running under identical conditions. However, since Linux is a multi-tasking, multi-user system, some tasks will be running in the background and will eventually influence the benchmark results.

This "random" error can be expressed as a time measurement (e.g. 20 seconds + or - 0.2 s) or as a percentage of the figure obtained by the benchmark considered (e.g. 20 seconds + or - 1%). Other terms sometimes used to describe variations in results ar e "variance", "noise", or "jitter".

Note that whereas resolution is system dependent, precision is a characteristic of each benchmark. Ideally, a well-designed benchmark will have a precision smaller than or equal to the resolution of the system being tested. It is very important to iden tify the sources of noise for any particular benchmark, since this provides an indication of possibly erroneous results.

1.7 Synthetic benchmark

A program or program suite specifically designed to measure the performance of a subsystem (hardware, software, or a combination of both). Whetstone is an example of a synthetic benchmark.

1.8 Application benchmark

A commonly executed application is chosen and the time to execute a given task with this application is used as a benchmark. Application benchmarks try to measure the performance of computer systems for some category of real-world computing task. Measu ring the time your Linux box takes to compile the kernel can be considered as a sort of application benchmark.

1.9 Relevance

A benchmark or its results are said to be irrelevant when they fail to effectively measure the performance characteristic the benchmark was designed for. Conversely, benchmark results are said to be relevant when they allow an accurate prediction of re al-life performance or meaningful comparisons between different systems.



2. A variety of benchmarks

The performance of a Linux system may be measured by all sorts of different benchmarks:

  1. Kernel compilation performance.
  2. FPU performance.
  3. Integer math performance.
  4. Memory access performance.
  5. Disk I/O performance.
  6. Ethernet I/O performance.
  7. File I/O performance.
  8. Web server performance.
  9. Doom performance.
  10. Quake performance.
  11. X graphics performance.
  12. 3D rendering performance.
  13. SQL server performance.
  14. Real-time performance.
  15. Matrix performance.
  16. Vector performance.
  17. File server (NFS) performance.

Etc...


3. FPU tests: Whetstone and Sons, Ltd.

Floating-point (FP) instructions are among the least used while running Linux. They probably represent < 0.001% of the instructions executed on an average Linux box, unless one deals with scientific computations. Besides, if you really want to know how well designed the FPU in your processor is, it's easier to have a look at its data sheet and check how many clock cycles it takes to execute a given FPU instruction. But there are more benchmarks that measure FPU performance than anything else. Why ?

  1. RISC, pipelining, simultaneous issuing of instructions, speculative execution and various other CPU design tricks make the CPU performance, specially FPU performance, difficult to measure directly and simply. The execution time of an FPU instruction varies depending on the data, and a continuous stream of FPU instructions will execute under special circumstances that make direct predictions of performance impossible in most cases. Simulations (synthetic benchmarks) are needed.
  2. FPU tests are easier to write than other benchmarks. Just put a bunch of FP instructions together and make a loop: voilà !
  3. The Whetstone benchmark is widely (and freely) available in Basic, C and Fortran versions, in case you don't want to write your own FPU test.
  4. FPU figures look good for marketing purposes. Here is what Dave Sill, the author of the comp.benchmarks FAQ, has to say about MFLOPS: "Millions of Floating Point Operations Per Second. Supposedly the rate at which the system can execute floating point instructions. Varies widely between different benchmarks and different configurations of the same benchmarks. Popular with marketing types because it's sounds like a "hard" value like miles per hour, and represents a simple concept."
  5. If you are going to buy a Cray, you'd better have an excuse for it.
  6. You can't get a data sheet for the Cray (or don't believe the numbers), but still want to know its FP performance.
  7. You want to keep your CPU busy doing all sorts of useless FP calculations, and want to check that the chip gets very hot.
  8. You want to discover the next big bug in the FPU of your processor, and get rich speculating with the manufacturer's shares.

Etc...

3.1 Whetstone history and general features

The original Whetstone benchmark was designed in the 60's by Brian Wichmann at the National Physical Laboratory, in England, as a test for an ALGOL 60 compiler for a hypothetical machine. The compilation system was named after the small town of Whetstone, where it was designed, and the name seems to have stuck to the benchmark itself.

The first practical implementation of the Whetstone benchmark was written by Harold Curnow in FORTRAN in 1972 (Curnow and Wichmann together published a paper on the Whetstone benchmark in 1976 for The Computer Journal). Historically it is the first major synthetic benchmark. It is designed to measure the execution speed of a variety of FP instructions (+, *, sin, cos, atan, sqrt, log, exp) on scalar and vector data, but also contains some integer code. Results are provided in MWIPS (Millions of Whetstone Instructions Per Second). The meaning of the expression "Whetstone Instructions" is not clear, though, at least after close examination of the C source code.

During the late 80's and early 90's it was recognized that Whetstone would not adequately measure the FP performance of parallel multiprocessor supercomputers (e.g. Cray and other mainframes dedicated to scientific computations). This spawned the development of various modern benchmarks, many of them with names like Fhoostone, as a humorous reference to Whetstone. Whetstone however is still widely used, because it provides a very reasonable metric as a measure of uniprocessor FP performance.

Whetstone has other interesting qualities for Linux users:

3.2 Getting the source and compiling it

Getting the standard C version by Roy Longbottom.

The version of the Whetstone benchmark that we are going to use for this example was slightly modified by Al Aburto and can be downloaded from his excellent FTP site dedicated to benchmarks. After downloading the file whets.c, you will have to edit slightly the source: a) Uncomment the "#define POSIX1" directive (this enables the Linux compatible timer routine). b) Uncomment the "#define DP" directive (since we are only interested in the Double Precision results).

Compiling

This benchmark is extremely sensitive to compiler optimization options. Here is the line I used to compile it: cc whets.c -o whets -O2 -fomit-frame-pointer -ffast-math -fforce-addr -fforce-mem -m486 -lm.

Note that some compiler options of some versions of gcc are buggy, most notably one of -O, -O2, -O3, ... together with -funroll-loops can cause gcc to emit incorrect code on a Linux box. You can test your gcc with a short test program available at Uwe Mayer's site. Of course, if your compiler is buggy, then any test results are not written in stone, to say the least (pun intended). In short, don't use -funroll-loops to compile this benchmark, and try to stick to the optimization options listed above.

3.3 Running Whetstone and gathering results

First runs

Just execute whets. Whetstone will display its results on standard output and also write a whets.res file if you give it the information it requests. Run it a few times to confirm that variations in the results are very small.

With L1, L2 or both L1 and L2 caches disabled

Some motherboards allow you to disable the L1 (internal) or L2 (external) caches through the BIOS configuration menus (take a look at the motherboard's manual; the ASUS P55T2P4 motherboard, for example, allows disabling both caches separately or together). You may want to experiment with these settings and/or main memory (DRAM) timing settings.

Without optimization

You can try to compile whets.c without any special optimization options, just to verify that compiler quality and compiler optimization options do influence benchmark results.

3.4 Examining the source code, the object code and interpreting the results

General program structure

The Whetstone benchmark main loop executes in a few milliseconds on an average modern machine, so its designers decided to provide a calibration procedure that will first execute 1 pass, then 5, then 25 passes, etc... until the calibration takes more than 2 seconds, and then guess a number of passes xtra that will result in an approximate running time of 100 seconds. It will then execute xtra passes of each one of the 8 sections of the main loop, measure the running time for each (for a total running time very near to 100 seconds) and calculate a rating in MWIPS, the Whetstone metric. This is an interesting variation in the two basic procedures described in Section 1.

Main loop

The main loop consists of 8 sections each containing a mix of various instructions representative of some type of computational task. Each section is itself a very short, very small loop, and has its own timing calculation. The code that gets looped through for section 8 for example is a single line of C code:

x = sqrt(exp(log(x)/t1); where x = 0.75 and t1=0.50000025, both defined as doubles.

Executable code size, library calls

Compiling as specified above with gcc 2.7.2.1, the resulting ELF executable whets is 13 096 bytes long on my system. It calls libc and of course libm for the trigonometric and transcendental math functions, but these should get compiled to very short executable code sequences since all modern CPUs have FPUs with these functions wired-in.

General comments

Now that we have an FPU performance figure for our machine, the next step is comparing it to other CPUs. Have you noticed all the data that whets.c asked you after you had run it for the first time? Well, Al Aburto has collected Whetstone results for your convenience at his site, you may want to download the data file and have a look at it. This kind of benchmarking data repository is very important, because it allows comparisons between various different machines. More on this topic in one of my next articles.

Whetstone is not a Linux specific test, it's not even an OS specific test, but it certainly is a good test for the FPU in your Linux box, and also gives an indication of compiler efficiency for specific kinds of applications that involve FP calculations.

I hope this gave you a taste of what benchmarking is all about.


4. References

Other references for benchmarking terminology:


Copyright © 1997, André D. Balsa
Published in Issue 22 of the Linux Gazette, October 1997