CPoorMansTimingAndProfiling
1.3.6
Instrument C apps simply to profile selected code paths.
|
The following code fragment illustrates the most basic use. A single stop watch is created. It is used to measure a single elapsed time. The example also shows the use of the sleep function. The sleep function pauses execution of the thread for the specified number of milliseconds.
An example console output is:
Time below should be approximately 40 msec. with 1 iteration. Results: delay measurement = 41101432.0 ns. total
This example shows the following:
Example console output:
Work time = 520350905.0 ns. total, (10 * 52035090.5 ns.), +/- 59541.61 ns., coef-of-variation=0.001, min/max 51874521.0/52082855.0 ns., max-min spread 208334.0 ns.
Stop watches can be used to track overall times with component sub-times. Stop watches are independent of each other. They start tracking time when started and stop accumulating time when stopped.
There is an option that will track the order of activations of stop watches and report times as a call graph. Call graph tracking is enabled with SWSetNestedTiming(1). It is not the nesting of method calls that is important. A stop watch time is considered to be a sub-time if it is started while another stop watch is still active.
A trivial example to illustrate this follows. Note that the start/stop nesting of the four stop watches does not follow the nesting of a(), b(), and c().
Console output from this example is:
b() = 209071084.0 ns. total, (10 * 20907108.4 ns.) c()-1 = 209939378.0 ns. total, (10 * 20993937.8 ns.) c()-2 = 309792590.0 ns. total, (10 * 30979259.0 ns.) Outer loop = 1147496720.0 ns. total, (10 * 114749672.0 ns.) ROOT 0x7fff72e70960 Outer loop = 1147496720.0 ns. total, (10 * 114749672.0 ns.) b() = 209071084.0 ns. total, (10 * 20907108.4 ns.) c()-1 = 209939378.0 ns. total, (10 * 20993937.8 ns.) c()-2 = 309792590.0 ns. total, (10 * 30979259.0 ns.)
One should be mindful of clock resolution. Either,
The following code snippet shows how to determine clock resolution and the start/stop software overhead. Overhead is determined internally by making repeated invocations of SWStart() followed by SWStop() in a loop with no other code in between.
On the other hand if one was very cautious then one would also want to know what typical jitter is for each start/stop measurement. The following snippet shows how this can be determined.
The console output from this second snippet on the authors development system is:
The resolution of the clock on this system = 17 ns. The start/stop overhead = 19.151236 ns. The start/stop jitter is +-33.768802 ns.
The library is compiled by default with support for POSIX threads. To compile without pthreads support remove the definition of _USE_PTHREADS from CFLAGS in the makefile.
With threading support enabled it is possible to track call sequences for each thread separately. The stop watch reflects all the start/stop invocations across all threads.
The following code will help illustrate this.
In the example there is a method, body() which is run by each of two threads. Thus the method is invoked a total of two times though only once from each thread. In one thread body() invokes foo() and bar() though in the other thread body() invokes only bar(). Thus foo() is invoked a total of two times while bar() is invoked just one time. The output from this example is:
A = 106314247.0 ns. total, [normalized 100.000 = 1 * 100.0000] Body = 63133710.0 ns. total, (2 * 31566855.0 ns.), [normalized 59.384 = 2 * 29.6920] A.1 = 21053226.0 ns. total, [normalized 19.803 = 1 * 19.8028] A.2 = 20656612.0 ns. total, [normalized 19.430 = 1 * 19.4298] Body-bar = 21062579.0 ns. total, +/- 0.00 ns., coef-of-variation=0.000, [normalized 19.812 = 1 * 19.8116] Body-foo = 42004972.0 ns. total, (2 * 21002486.0 ns.), +/- 64650.77 ns., coef-of-variation=0.003, [normalized 39.510 = 2 * 19.7551] counter = 0.0 ns. total, (3 * 0.0 ns.), +/- 0.00 ns., coef-of-variation=nan, [normalized 0.000 = 3 * 0.0000] ROOT 0x7fff72e70960 A = 106314247.0 ns. total, [normalized 100.000 = 1 * 100.0000] A.1 = 21053226.0 ns. total, [normalized 19.803 = 1 * 19.8028] A.2 = 20656612.0 ns. total, [normalized 19.430 = 1 * 19.4298] ROOT 0x104281000 Body = 63133710.0 ns. total, (2 * 31566855.0 ns.), [normalized 59.384 = 2 * 29.6920] Body-foo = 42004972.0 ns. total, (2 * 21002486.0 ns.), +/- 64650.77 ns., coef-of-variation=0.003, [normalized 39.510 = 2 * 19.7551] Body-bar = 21062579.0 ns. total, +/- 0.00 ns., coef-of-variation=0.000, [normalized 19.812 = 1 * 19.8116] ROOT 0x104304000 Body = 63133710.0 ns. total, (2 * 31566855.0 ns.), [normalized 59.384 = 2 * 29.6920] Body-foo = 42004972.0 ns. total, (2 * 21002486.0 ns.), +/- 64650.77 ns., coef-of-variation=0.003, [normalized 39.510 = 2 * 19.7551]
The following code shows how two timers can be created in C++. Direct use of the StopWatch constructor is discouraged. There are two macros provided to help ensure that the source file name and line number are properly passed to the actual constructor.
Two stop watches are created. They each measure a fixed time.
An example console output is:
Each time below should be approximately 40 msec. with 1 iteration. Results: timing 1 = 41062326.0 ns. total timing 2 = 41064048.0 ns. total
OpenCL allows you to collect timing data based on clEvent queuing, submission, start, and/or stop. OpenCL timing is based on nanoseconds which means that elapsed times can be used directly as the argument to SWAddDelta() (and StopWatch::addDelta()). However, convenience methods have been added to this library to facilitate easily adding execution timing to a stop watch. The following code fragment illustrates this:
Example console output is:
time on queue prior to start of write = 3.428 us. total (1 iter.) time to write an int array of size 32767 = 174.492 us. total (1 iter.)