martinus/nanobench

Idea? Timing only particular sections of code

Andersama opened this issue · 7 comments

Recently I've been testing a few different sorting algorithms, the rough setup has a preallocated block of memory filled with random data. However that means for each benchmark run after a sort, I have to scramble or generate more random data, and this code is shared between all the benchmarks. Which roughly translates to a benchmark which is really measuring the cost of those two things together, rather than just the sorting algorithm on its own. Presumably the sort algorithm overwhelms the cost of generating new data, but it's difficult to gauge exactly how much of a cost generating data is without running a benchmark with that part on it's own.

It seems almost as if with a few edits to add additional callbacks it'd be possible to add timings or even ignore parts of code which are not really part of the test. If a second callback doesn't make the code more unstable / slow to test it'd probably be a handy tool for cases like this.

Might look like:

    template <typename Start, typename Op, typename End>
    ANKERL_NANOBENCH(NOINLINE)
    Bench& run(std::string const& benchmarkName, Start&& start, Op&& op, End&& end);

The problem of that feature is that when the runtime of op is not significantly (~2000 times or so) higher than the measurement resolution, the result will be highly inaccurate.

A simple solution is to do two benchmarks. One benchmark that measures Start + Op + End, and another benchmark that measures Start + End. The difference is the runtime for Op.

Having a feature that does these two benchmarks automatically and calculate the statistics from that would be nice to have though. But I really don't want a features that enables/disables a timer each run.

What's your gauge for highly in-accurate? I'm fairly confident your library's pretty good at what it's doing. Might have to do with whether a high performance clock is available, I'm pretty sure that your library is making use of one on my machine, but in benchmark runs I'm looking at some pretty fast functions which are roughly a cycle or two at most, and your library has seemingly been able to accurately test those roughly 1-3% self reported error. There's obviously a lot of variance in what the machine is doing at the time, but I'd rather have a real-world benchmark like that on my machine. Should I be re-evaluating those tests?

Nanobench is so accurate because it measures a lot of loops of the operation, not just a single one. It determines how many loops it need to run so the clock is reliable, then runs Op say 10000 times, and then divides the measurement result by 10000.

E.g. on my computer the std::chrono::steady_clock has a resolution of about 30ns, which is already pretty good. That means whatever you want to measure has to run for at least 30 microseconds to get relatively reliable measurements. nanobench tries to figure out the loop counter to achieve the 30microseconds, and then does the division. So in short, it's not actually measuring each call of Op.

Ah, ok, maybe lost in the weeds here, I sort've figured you're running multiple loops, the settings you have in the api give that away. I might play around then and see if I can implement this.

Not exactly familiar enough with how you're going about things in the api, this is just a skeleton:

template <typename Start, typename Op, typename End>
ANKERL_NANOBENCH_NO_SANITIZE("integer")
Bench& Bench::run(Start&& start, Op&& op, End&& end) {
    // It is important that this method is kept short so the compiler can do better optimizations/ inlining of op()
    detail::IterationLogic iterationLogic(*this);
    auto& pc = detail::performanceCounters();

    detail::IterationLogic iterationLogic2(*this);
    auto &pc2 = detail::performanceCounters();

    while (auto n = iterationLogic.numIters()) {
        pc.beginMeasure();
        Clock::time_point before = Clock::now();
        while (n-- > 0) {
            start();
            op();
            end();
        }
        Clock::time_point after = Clock::now();
        pc.endMeasure();
        pc.updateResults(iterationLogic.numIters());
        iterationLogic.add(after - before, pc);
    }
    // Ideally start() and end() are fast
    while (auto n = iterationLogic2.numIters()) {
        pc2.beginMeasure();
        Clock::time_point before = Clock::now();
        while (n-- > 0) {
            start();
            end();
        }
        Clock::time_point after = Clock::now();
        pc2.endMeasure();
        pc2.updateResults(iterationLogic2.numIters());
        iterationLogic2.add(after - before, pc2);
    }
//Subtract the results from the second loop from the first?

    return *this;
}

Got a bit confused, I guess your setup is updateResults() or iterationLogic.add() is responsible for adding results? This is just an automated run of the callbacks split from each other back to back. Not exactly pretty in the console output, but works.

template <typename Start, typename Op, typename End>
ANKERL_NANOBENCH_NO_SANITIZE("integer")
Bench& Bench::run(Start&& start, Op&& op, End&& end) {
    // It is important that this method is kept short so the compiler can do better optimizations/ inlining of op()
    detail::IterationLogic iterationLogic(*this);
    auto& pc = detail::performanceCounters();

    detail::IterationLogic iterationLogic2(*this);
    auto &pc2 = detail::performanceCounters();

    while (auto n = iterationLogic.numIters()) {
        pc.beginMeasure();
        Clock::time_point before = Clock::now();
        while (n-- > 0) {
            start();
            op();
            end();
        }
        Clock::time_point after = Clock::now();
        pc.endMeasure();
        pc.updateResults(iterationLogic.numIters());
        iterationLogic.add(after - before, pc);
    }
    // Could probably do w/ less allocations
    std::string title_tmp = name();
    std::string tmp_title = title_tmp + " (setup cost)";
    name(tmp_title);
    // Ideally start() and end() are fast
    while (auto n = iterationLogic2.numIters()) {
        pc2.beginMeasure();
        Clock::time_point before = Clock::now();
        while (n-- > 0) {
            start();
            end();
        }
        Clock::time_point after = Clock::now();
        pc2.endMeasure();
        pc2.updateResults(iterationLogic2.numIters());
        iterationLogic2.add(after - before, pc2);
    }
    iterationLogic.moveResultTo(mResults);
    iterationLogic2.moveResultTo(mResults);
    name(title_tmp);

    return *this;
}

See #86