Timer function to provide time in nano seconds using C++


I wish to calculate the time it took for an API to return a value. The time taken for such an action is in the space of nano seconds. As the API is a C++ class/function, I am using the timer.h to caculate the same:

  #include <ctime>
  #include <cstdio>

  using namespace std;

  int main(int argc, char** argv) {

      clock_t start;
      double diff;
      start = clock();
      diff = ( std::clock() - start ) / (double)CLOCKS_PER_SEC;
      cout<<"printf: "<< diff <<'\n';

      return 0;

The above code gives the time in seconds. How do I get the same in nano seconds and with more precision?

7/14/2015 1:22:56 PM

Accepted Answer

What others have posted about running the function repeatedly in a loop is correct.

For Linux (and BSD) you want to use clock_gettime().

#include <sys/time.h>

int main()
   timespec ts;
   // clock_gettime(CLOCK_MONOTONIC, &ts); // Works on FreeBSD
   clock_gettime(CLOCK_REALTIME, &ts); // Works on Linux

For windows you want to use the QueryPerformanceCounter. And here is more on QPC

Apparently there is a known issue with QPC on some chipsets, so you may want to make sure you do not have those chipset. Additionally some dual core AMDs may also cause a problem. See the second post by sebbbi, where he states:

QueryPerformanceCounter() and QueryPerformanceFrequency() offer a bit better resolution, but have different issues. For example in Windows XP, all AMD Athlon X2 dual core CPUs return the PC of either of the cores "randomly" (the PC sometimes jumps a bit backwards), unless you specially install AMD dual core driver package to fix the issue. We haven't noticed any other dual+ core CPUs having similar issues (p4 dual, p4 ht, core2 dual, core2 quad, phenom quad).

EDIT 2013/07/16:

It looks like there is some controversy on the efficacy of QPC under certain circumstances as stated in http://msdn.microsoft.com/en-us/library/windows/desktop/ee417693(v=vs.85).aspx

...While QueryPerformanceCounter and QueryPerformanceFrequency typically adjust for multiple processors, bugs in the BIOS or drivers may result in these routines returning different values as the thread moves from one processor to another...

However this StackOverflow answer https://stackoverflow.com/a/4588605/34329 states that QPC should work fine on any MS OS after Win XP service pack 2.

This article shows that Windows 7 can determine if the processor(s) have an invariant TSC and falls back to an external timer if they don't. http://performancebydesign.blogspot.com/2012/03/high-resolution-clocks-and-timers-for.html Synchronizing across processors is still an issue.

Other fine reading related to timers:

See the comments for more details.

5/23/2017 12:32:26 PM

This new answer uses C++11's <chrono> facility. While there are other answers that show how to use <chrono>, none of them shows how to use <chrono> with the RDTSC facility mentioned in several of the other answers here. So I thought I would show how to use RDTSC with <chrono>. Additionally I'll demonstrate how you can templatize the testing code on the clock so that you can rapidly switch between RDTSC and your system's built-in clock facilities (which will likely be based on clock(), clock_gettime() and/or QueryPerformanceCounter.

Note that the RDTSC instruction is x86-specific. QueryPerformanceCounter is Windows only. And clock_gettime() is POSIX only. Below I introduce two new clocks: std::chrono::high_resolution_clock and std::chrono::system_clock, which, if you can assume C++11, are now cross-platform.

First, here is how you create a C++11-compatible clock out of the Intel rdtsc assembly instruction. I'll call it x::clock:

#include <chrono>

namespace x

struct clock
    typedef unsigned long long                 rep;
    typedef std::ratio<1, 2'800'000'000>       period; // My machine is 2.8 GHz
    typedef std::chrono::duration<rep, period> duration;
    typedef std::chrono::time_point<clock>     time_point;
    static const bool is_steady =              true;

    static time_point now() noexcept
        unsigned lo, hi;
        asm volatile("rdtsc" : "=a" (lo), "=d" (hi));
        return time_point(duration(static_cast<rep>(hi) << 32 | lo));

}  // x

All this clock does is count CPU cycles and store it in an unsigned 64-bit integer. You may need to tweak the assembly language syntax for your compiler. Or your compiler may offer an intrinsic you can use instead (e.g. now() {return __rdtsc();}).

To build a clock you have to give it the representation (storage type). You must also supply the clock period, which must be a compile time constant, even though your machine may change clock speed in different power modes. And from those you can easily define your clock's "native" time duration and time point in terms of these fundamentals.

If all you want to do is output the number of clock ticks, it doesn't really matter what number you give for the clock period. This constant only comes into play if you want to convert the number of clock ticks into some real-time unit such as nanoseconds. And in that case, the more accurate you are able to supply the clock speed, the more accurate will be the conversion to nanoseconds, (milliseconds, whatever).

Below is example code which shows how to use x::clock. Actually I've templated the code on the clock as I'd like to show how you can use many different clocks with the exact same syntax. This particular test is showing what the looping overhead is when running what you want to time under a loop:

#include <iostream>

template <class clock>
    // Define real time units
    typedef std::chrono::duration<unsigned long long, std::pico> picoseconds;
    // or:
    // typedef std::chrono::nanoseconds nanoseconds;
    // Define double-based unit of clock tick
    typedef std::chrono::duration<double, typename clock::period> Cycle;
    using std::chrono::duration_cast;
    const int N = 100000000;
    // Do it
    auto t0 = clock::now();
    for (int j = 0; j < N; ++j)
        asm volatile("");
    auto t1 = clock::now();
    // Get the clock ticks per iteration
    auto ticks_per_iter = Cycle(t1-t0)/N;
    std::cout << ticks_per_iter.count() << " clock ticks per iteration\n";
    // Convert to real time units
    std::cout << duration_cast<picoseconds>(ticks_per_iter).count()
              << "ps per iteration\n";

The first thing this code does is create a "real time" unit to display the results in. I've chosen picoseconds, but you can choose any units you like, either integral or floating point based. As an example there is a pre-made std::chrono::nanoseconds unit I could have used.

As another example I want to print out the average number of clock cycles per iteration as a floating point, so I create another duration, based on double, that has the same units as the clock's tick does (called Cycle in the code).

The loop is timed with calls to clock::now() on either side. If you want to name the type returned from this function it is:

typename clock::time_point t0 = clock::now();

(as clearly shown in the x::clock example, and is also true of the system-supplied clocks).

To get a duration in terms of floating point clock ticks one merely subtracts the two time points, and to get the per iteration value, divide that duration by the number of iterations.

You can get the count in any duration by using the count() member function. This returns the internal representation. Finally I use std::chrono::duration_cast to convert the duration Cycle to the duration picoseconds and print that out.

To use this code is simple:

int main()
    std::cout << "\nUsing rdtsc:\n";

    std::cout << "\nUsing std::chrono::high_resolution_clock:\n";

    std::cout << "\nUsing std::chrono::system_clock:\n";

Above I exercise the test using our home-made x::clock, and compare those results with using two of the system-supplied clocks: std::chrono::high_resolution_clock and std::chrono::system_clock. For me this prints out:

Using rdtsc:
1.72632 clock ticks per iteration
616ps per iteration

Using std::chrono::high_resolution_clock:
0.620105 clock ticks per iteration
620ps per iteration

Using std::chrono::system_clock:
0.00062457 clock ticks per iteration
624ps per iteration

This shows that each of these clocks has a different tick period, as the ticks per iteration is vastly different for each clock. However when converted to a known unit of time (e.g. picoseconds), I get approximately the same result for each clock (your mileage may vary).

Note how my code is completely free of "magic conversion constants". Indeed, there are only two magic numbers in the entire example:

  1. The clock speed of my machine in order to define x::clock.
  2. The number of iterations to test over. If changing this number makes your results vary greatly, then you should probably make the number of iterations higher, or empty your computer of competing processes while testing.

Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow