Why are core-to-core latencies stochastic when the executable is run twice in succession but not stochastic within a run?

1 vote

0 answers

79 views

I'm benchmarking core-to-core latency on my server to find good core affinities. I'm trying to set the core affinities of two threads to different CPUs, and time the latency of message between the threads. The message is passed via std::atomic. Runtimes are computed via https://github.com/fuatu/core-latency-atomic Core affinities are assigned (POSIX) via

void set_affinity(long cpu_num) {
  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(cpu_num, &cpuset);
  pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

Runtimes are measured via an atomic accessed by two threads:

enum State
{
  Preparing,
  Ready,
  Ping,
  Pong,
  Finish,
};

class Sync
{
public:
  State wait_as_long_as(State wait_state)
  {
    State loaded_state = state.load();
    while (loaded_state == wait_state)
      loaded_state = state.load();
    return loaded_state;
  }

  void wait_until(State expected_state)
  {
    while (state.load() != expected_state)
    {
    }
  }

  void set(State new_state, State expected_state)
  {
    //state.store(new_state);
    state.compare_exchange_strong(expected_state, new_state);
  }

private:
  std::atomic state{Preparing};
};

static void set_affinity(unsigned int cpu_num)
{
  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(cpu_num, &cpuset);
  pthread_setaffinity_np(pthread_self(), sizeof(cpu_set_t), &cpuset);
}

struct LatencyBench
{
  LatencyBench(long first_cpu_, long second_cpu_)
    : first_cpu{first_cpu_}
    , second_cpu{second_cpu_}
  {
  }

  void operator()(nonius::chronometer meter) const
  {
    Sync sync;

    set_affinity(first_cpu);

    std::thread t([&] {
      set_affinity(second_cpu);
      sync.set(Ready,Preparing);

      State state = sync.wait_as_long_as(Ready);
      while (state != Finish)
      {
        //if (state == Ping)
        sync.set(Pong,Ping);
        state = sync.wait_as_long_as(Pong);
      }
    });

    sync.wait_until(Ready);

    // start timer
    sync.set(Ping,Ready);
    sync.wait_until(Pong);
    // stop timer

    sync.set(Finish,Pong);
    t.join();
  }

  const long first_cpu;
  const long second_cpu;
};

Runtimes are only measured during the thread communication; they do not include time to start ./a.out or the time to start the second thread. The latencies measured are robust to replicate trials within the same program; if I sleep in main and then run the analysis a second time, the latencies are still within roughly 1ns from the first measurement. But if I run ./a.out again, the latency measured can change by roughly 40ns. This can also change the core pair with best latency. Do you know what could be behind these stochastic latency changes when running the same executable twice? **Additional details:** Using numactl -m 0 -N 0 ./a.out and focusing on core pairs within NUMA node 0 doesn't alleviate. Using a server with sub-NUMA nodes configured and staying withing one sub-NUMA node likewise doesn't change this. The variability in latency has been replicated on both Xeon and EPYC processors. I'd expect that multiple analysis within main would likewise suggest it is not a caching issue, since the second analysis in main would likely have everything cached.

Asked by souser (11 rep)

Dec 26, 2023, 08:14 PM
Last activity: Dec 28, 2023, 05:53 AM

Why are core-to-core latencies stochastic when the executable is run twice in succession but not stochastic within a run?

Related Questions