CPoorMansTimingAndProfiling  1.3.6
Instrument C apps simply to profile selected code paths.
 All Data Structures Functions Typedefs Pages

Simple Example

The following code fragment illustrates the most basic use. A single stop watch is created. It is used to measure a single elapsed time. The example also shows the use of the sleep function. The sleep function pauses execution of the thread for the specified number of milliseconds.

#include <stdio.h>
#include <stdlib.h>
#include "timing.h"
int main( void )
{
stopWatch * sw = SWCreate("delay measurement");
SWStart(sw);
SWStop(sw);
printf("Time below should be approximately 40 msec. with 1 iteration.\n");
SWDumpTimings("Results:");
SWDestroy(sw);
return 0;
}

An example console output is:

Time below should be approximately 40 msec. with 1 iteration.
Results:
delay measurement = 41101432.0 ns. total

Timing Repeating Code Section

This example shows the following:

  • Using a conditional to disable timings for production.
  • Enabling the collection and computation of standard deviation and min/max.
  • Timing a repeating code section.
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "timing.h"
#define ENABLE_TIMING 1
void simulate_work() {
SWSleepMillis(51); //simulate elapsed time
}
int main( void )
{
stopWatch* sw;
if (ENABLE_TIMING){
sw = SWCreate("Work time");
}
int i;
for (i = 0; i < 10; i++) {
if (ENABLE_TIMING) SWStart(sw);
simulate_work();
if (ENABLE_TIMING) SWStop (sw);
}
if (ENABLE_TIMING) SWDumpTimings("");
return 0;
}

Example console output:

Work time = 520350905.0 ns. total, (10 * 52035090.5 ns.),  +/- 59541.61 ns.,  coef-of-variation=0.001, min/max 51874521.0/52082855.0 ns., max-min spread 208334.0 ns.

Nested Timing

Stop watches can be used to track overall times with component sub-times. Stop watches are independent of each other. They start tracking time when started and stop accumulating time when stopped.

There is an option that will track the order of activations of stop watches and report times as a call graph. Call graph tracking is enabled with SWSetNestedTiming(1). It is not the nesting of method calls that is important. A stop watch time is considered to be a sub-time if it is started while another stop watch is still active.

A trivial example to illustrate this follows. Note that the start/stop nesting of the four stop watches does not follow the nesting of a(), b(), and c().

#include <stdio.h>
#include <stdlib.h>
#include "timing.h"
stopWatch * sw1;
stopWatch * sw2a;
stopWatch * sw2b;
stopWatch * sw2c;
void c(){
// part 1 of c
SWStart(sw2b);
SWSleepMillis(20); // time is tallied to sw1 and sw2b
SWStop(sw2b);
// part 2 of c
SWStart(sw2c);
SWSleepMillis(30); // time is tallied to sw1 and sw2c
SWStop(sw2c);
SWSleepMillis(30); // time is tallied to sw1
}
void b(){
SWSleepMillis(10); // time is tallied to sw1
SWStart(sw2a);
SWSleepMillis(20); // time is tallied to sw1 and sw2a
SWStop(sw2a);
c();
}
void a(){
SWStart(sw1);
b();
SWStop(sw1);
}
int main( void )
{
sw1 = SWCreate("Outer loop");
sw2a = SWCreate("b()");
sw2b = SWCreate("c()-1");
sw2c = SWCreate("c()-2");
int i;
for (i = 0; i<10; i++){
a();
}
return 0;
}

Console output from this example is:

b() = 209071084.0 ns. total, (10 * 20907108.4 ns.)
c()-1 = 209939378.0 ns. total, (10 * 20993937.8 ns.)
c()-2 = 309792590.0 ns. total, (10 * 30979259.0 ns.)
Outer loop = 1147496720.0 ns. total, (10 * 114749672.0 ns.)
ROOT
  0x7fff72e70960
    Outer loop = 1147496720.0 ns. total, (10 * 114749672.0 ns.)
      b() = 209071084.0 ns. total, (10 * 20907108.4 ns.)
      c()-1 = 209939378.0 ns. total, (10 * 20993937.8 ns.)
      c()-2 = 309792590.0 ns. total, (10 * 30979259.0 ns.)

Clock Resolution, Overhead, Jitter

One should be mindful of clock resolution. Either,

  • The clock resolution is small compared with the duration of the code that one is timing, or,
  • Enough repetitive times are being collected to "average out" the effect of less accurate clocks.

The following code snippet shows how to determine clock resolution and the start/stop software overhead. Overhead is determined internally by making repeated invocations of SWStart() followed by SWStop() in a loop with no other code in between.

double overhead = SWEstimateStartStopOverhead();
long granularity = SWDetermineNanosecondGranularity();
printf("The resolution of the clock on this system = %ld ns.\n", granularity);
printf("The start/stop overhead = %f ns.\n", overhead);

On the other hand if one was very cautious then one would also want to know what typical jitter is for each start/stop measurement. The following snippet shows how this can be determined.

stopWatch *sw = SWCreate("sw");
double overhead = SWEstimateStartStopOverheadOfSW(sw);
long granularity = SWDetermineNanosecondGranularity();
double jitter = SWGetStdDev(sw);
printf("The resolution of the clock on this system = %ld ns.\n", granularity);
printf("The start/stop overhead = %f ns.\n", overhead);
printf("The start/stop jitter is +-%f ns.\n", jitter);

The console output from this second snippet on the authors development system is:

The resolution of the clock on this system = 17 ns.
The start/stop overhead = 19.151236 ns.
The start/stop jitter is +-33.768802 ns.

POSIX Threads

The library is compiled by default with support for POSIX threads. To compile without pthreads support remove the definition of _USE_PTHREADS from CFLAGS in the makefile.

With threading support enabled it is possible to track call sequences for each thread separately. The stop watch reflects all the start/stop invocations across all threads.

The following code will help illustrate this.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <timing.h>
#include <pthread.h>
#define assertEquals(s,exp,act,tolerance) {\
double _e = exp; \
double _a = act; \
if (fabs(_e-_a) > tolerance) { \
printf("ERROR: for %s the expected value of %f did not match the actual value of %f\n", s, _e, _a);\
}\
}
stopWatch *swBody;
stopWatch *swFoo;
stopWatch *swBar;
pthread_t thread1;
pthread_t thread2;
stopWatch *swCounter;
volatile int first = 1;
void delay()
{
SWSleepMillis(20); // 20 milliseconds sleep
}
void bar(){
SWStart(swBar);
delay();
SWStop(swBar);
SWAddDelta(swCounter, 0.0);
}
void foo(){
SWAddDelta(swCounter, 0.0);
SWStart(swFoo);
delay();
SWStop(swFoo);
}
void *body( void * unused){
SWStart(swBody);
foo();
if (first){
first = 0;
bar();
}
SWStop(swBody);
return NULL;
}
void test(){
printf("\nTest threaded nested timing\n");
stopWatch* t1 = SWCreate("A");
stopWatch* t2 = SWCreate("A.1");
stopWatch* t3 = SWCreate("A.2");
swBody = SWCreate("Body");
swFoo = SWCreateStatistic("Body-foo");
swBar = SWCreateStatistic("Body-bar");
swCounter = SWCreateStatistic("counter");
SWStart(t1);
delay(); // 20
int rc = pthread_create(&thread1, NULL, body, NULL);
if (rc){
printf("ERROR; return code from pthread_create() 1 is %d\n", rc);
exit(-1);
}
rc = pthread_create(&thread2, NULL, body, NULL);
if (rc){
printf("ERROR; return code from pthread_create() 2 is %d\n", rc);
exit(-1);
}
SWStart(t2);
delay(); // 40
SWStop(t2);
delay(); // 60
SWStart(t3);
delay(); // 80
SWStop(t3);
delay(); // 100
SWStop(t1);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
// internally time is in nanosecondts
// t2 & t3 should be approximately 20 ms
double a = SWGetCumulative(t1) / 1000000.0;
assertEquals("A", 100.0, a, 10);
double b = SWGetCumulative(t2) / 1000000.0;
assertEquals("A.1", 20.0, b, 10);
double c = SWGetCumulative(t3) / 1000000.0;
assertEquals("A.2", 20.0, b, 15);
// t1 should be approximately t2 + t3 + 60 ms
assertEquals("A should be approximately A.1 + A.2 + 60 ms", 60+ b +c , a, 15);
SWDumpTimings("timings");
// cleanup
SWDestroy(swBar);
SWDestroy(swFoo);
SWDestroy(swBody);
SWDestroy(t3);
SWDestroy(t2);
SWDestroy(t1);
}

In the example there is a method, body() which is run by each of two threads. Thus the method is invoked a total of two times though only once from each thread. In one thread body() invokes foo() and bar() though in the other thread body() invokes only bar(). Thus foo() is invoked a total of two times while bar() is invoked just one time. The output from this example is:

A = 106314247.0 ns. total, [normalized 100.000 = 1 * 100.0000]
Body = 63133710.0 ns. total, (2 * 31566855.0 ns.), [normalized 59.384 = 2 * 29.6920]
A.1 = 21053226.0 ns. total, [normalized 19.803 = 1 * 19.8028]
A.2 = 20656612.0 ns. total, [normalized 19.430 = 1 * 19.4298]
Body-bar = 21062579.0 ns. total,  +/- 0.00 ns.,  coef-of-variation=0.000, [normalized 19.812 = 1 * 19.8116]
Body-foo = 42004972.0 ns. total, (2 * 21002486.0 ns.),  +/- 64650.77 ns.,  coef-of-variation=0.003, [normalized 39.510 = 2 * 19.7551]
counter = 0.0 ns. total, (3 * 0.0 ns.),  +/- 0.00 ns.,  coef-of-variation=nan, [normalized 0.000 = 3 * 0.0000]

ROOT
  0x7fff72e70960
    A = 106314247.0 ns. total, [normalized 100.000 = 1 * 100.0000]
      A.1 = 21053226.0 ns. total, [normalized 19.803 = 1 * 19.8028]
      A.2 = 20656612.0 ns. total, [normalized 19.430 = 1 * 19.4298]
ROOT
  0x104281000
    Body = 63133710.0 ns. total, (2 * 31566855.0 ns.), [normalized 59.384 = 2 * 29.6920]
      Body-foo = 42004972.0 ns. total, (2 * 21002486.0 ns.),  +/- 64650.77 ns.,  coef-of-variation=0.003, [normalized 39.510 = 2 * 19.7551]
      Body-bar = 21062579.0 ns. total,  +/- 0.00 ns.,  coef-of-variation=0.000, [normalized 19.812 = 1 * 19.8116]
ROOT
  0x104304000
    Body = 63133710.0 ns. total, (2 * 31566855.0 ns.), [normalized 59.384 = 2 * 29.6920]
      Body-foo = 42004972.0 ns. total, (2 * 21002486.0 ns.),  +/- 64650.77 ns.,  coef-of-variation=0.003, [normalized 39.510 = 2 * 19.7551]

Simple C++ Example

The following code shows how two timers can be created in C++. Direct use of the StopWatch constructor is discouraged. There are two macros provided to help ensure that the source file name and line number are properly passed to the actual constructor.

Two stop watches are created. They each measure a fixed time.

#include <iostream>
using namespace std;
#include "timing.h"
int main( void )
{
StopWatch * sw = CREATE_STOP_WATCH("timing 1");
sw->start();
sw->stop();
DECLARE_STOP_WATCH(sw2, "timing 2");
sw2.start();
sw2.stop();
cout << "Each time below should be approximately 40 msec. with 1 iteration." << endl;
delete(sw);
return 0;
}

An example console output is:

Each time below should be approximately 40 msec. with 1 iteration.
Results:
timing 1 = 41062326.0 ns. total
timing 2 = 41064048.0 ns. total

OpenCL Profile Event Timing

OpenCL allows you to collect timing data based on clEvent queuing, submission, start, and/or stop. OpenCL timing is based on nanoseconds which means that elapsed times can be used directly as the argument to SWAddDelta() (and StopWatch::addDelta()). However, convenience methods have been added to this library to facilitate easily adding execution timing to a stop watch. The following code fragment illustrates this:

// prior to this OpenCL profiling was enabled
// queue_openCL = clCreateCommandQueue(context, device_id, CL_QUEUE_PROFILING_ENABLE, &err);
stopWatch *sw1 = SWCreate("time on queue prior to start of write");
stopWatch *sw2 = SWCreate("time to write an int array of size 32767");
err = clEnqueueWriteBuffer ( queue_openCL, buf, CL_TRUE, 0, sizeof(int)*len, data, 0, NULL, &event);
SWAddCLEventStartEnd(sw2, event);
SWAddCLEventDeltas(sw1, event, CL_PROFILING_COMMAND_QUEUED, event, CL_PROFILING_COMMAND_START);

Example console output is:

time on queue prior to start of write = 3.428 us. total (1 iter.)
time to write an int array of size 32767 = 174.492 us. total (1 iter.)