/ccont

Tool burns CPUs on different NUMA nodes and measures execution time

Primary LanguageC

ccont: Tool burns CPUs on different NUMA nodes and measures execution time.

Description:
    The goal is to measure cache contention on different NUMA nodes,
    burn different CPUs, execute different instructions with different
    load patterns, e.g. the following is the list of three load patterns
    which were executed on machine with 2 NUMA nodes and 8 CPUs:

    o cpu-increase - on each iteration number of CPU is increased:

      # ./ccont --load cpu-increase --op cmpxchg
      Nodes  N0   N1  CPUs    operation       min       max       avg     stdev
       CPUs *--- ----    1      cmpxchg     8.938     8.938     8.938     0.000
       CPUs **-- ----    2      cmpxchg    36.114    36.119    36.117     0.004
       CPUs ***- ----    3      cmpxchg    54.270    54.272    54.271     0.001
       CPUs **** ----    4      cmpxchg    72.292    72.321    72.313     0.013
       CPUs **** *---    5      cmpxchg    61.691   108.060    98.782    20.735
       CPUs **** **--    6      cmpxchg   101.316   136.923   125.059    18.369
       CPUs **** ***-    7      cmpxchg   151.639   169.218   161.702     9.358
       CPUs **** ****    8      cmpxchg   192.281   196.250   194.281     2.098

    o node-cascade - on each iteration CPUs from each node are burned:

      # ./ccont --load node-cascade --op cmpxchg
      Nodes  N0   N1  CPUs    operation       min       max       avg     stdev
       CPUs **** ----    4      cmpxchg    72.287    72.322    72.310     0.016
       CPUs ---- ****    4      cmpxchg    72.327    72.333    72.330     0.003

    o cpu-rollover - on each iteration executor thread rolls to another CPU on
    the next node, keeping the same amount of CPUs burning:

      # ./ccont --load cpu-rollover --op cmpxcgh
      Nodes  N0   N1  CPUs    operation       min       max       avg     stdev
       CPUs **** ----    4      cmpxchg    48.769    48.774    48.772     0.002
       CPUs ***- *---    4      cmpxchg    85.506    97.754    94.683     6.118
       CPUs **-- **--    4      cmpxchg   116.803   121.450   119.108     2.658
       CPUs *--- ***-    4      cmpxchg    91.312   103.877   100.721     6.273
       CPUs ---- ****    4      cmpxchg    48.288    48.368    48.323     0.038

    Memory chunk for each load is always allocated on the node#0.

    Results show, that scattered tasks over NUMA nodes show bad performance for
    cmpxchg instruction (cpu-rollover pattern), but execution on remote node
    is not so bad, because of the L3 cache (node-cascade pattern).  Increase of
    the CPUs number can degrade performance by factor of 24 because of the cache
    line contention (cpu-increase pattern).

    The following burning operations are supported:

    o "idle" - idle loop:
          used just for calibrating.
              while (spins--)
                   ;

    o "memset64" - memset glibc call:
	      memsets 64 bytes (usual cache line size).

    o "memset128" - memset glibc call:
	      memsets 128 bytes.

    o "memset256" - memset glibc call:
	      memsets 256 bytes.

    o "test_bit" - btl:
          testing a bit, used for test_bit() in Linux kernel.
              var | (1 << bit)

    o "set_bit" - bts:
          test and set bit, used for test_and_set_bit() in Linux kernel.
          "test_bit" - name in test results.
              res = var | (1 << bit)
              var |= (1 << bit)

    o "inc" - lock inc:
          increment, used for atomic_inc() in Linux kernel.
              var += 1

    o "xadd" - lock xadd:
          exchanges operands, used for __sync_fetch_and_add()
          and similar gcc atomic builtins.
              tmp = src + dst;
              src = dst;
              dst = tmp;

    o "cmpxchg" - lock cmpxchg:
          exchanges operangs, used for cmpxchg() for all sorts of atomic
          exchanges in Linux kernel.
              res = var
              if (res == old)
                  var = new

    o "mfence" - mfence:
          memory barrier for load and store, used for smp_mb() in Linux kernel.

    o "sfence" - sfence:
          memory barrier for store, used for smp_wmb() in Linux kernel.

    o "lfence" - lfence:
          memory barrier for load, used for smp_rmb() in Linux kernel.