Question

Repeated short identical parallel jobs

I have an algorithm that requires many parallel reruns of the same code. The code is short and finishes in less than a microsecond. This will be run millions of times which creates some problems when using threads. I'm trying to make use of POSIX threads, but since the overhead of creating new threads is not an option, I need to use some kind of spinning worker thread that is constantly running instead.

Below is a simplified case and illustrates how I'm trying to run two equally long works, calc_x and calc_y. I need to be able to run them in parallel so that the total time taken is like running just one (or very close).

#include <pthread.h>
#include <time.h>
#include <stdatomic.h>
#include <stdio.h>

static int nloops = 100;
static int x;
static int y;
static pthread_t tid;
static atomic_bool trun;
static atomic_bool texit;
static int cnt = 0;
static const int MILLION = 1000000;


void calc_x() {
    x = 0;
    for (int i = 0; i < nloops; ++i) {
        x += i;
    }
}

void calc_y() {
    y = 0;
    for (int i = 0; i < nloops; ++i) {
        y += i;
    }
}

void *worker() {
    while (1) {
        if (trun) {
            calc_x();
            trun = false;
        }
        if (texit) {
            break;
        }
    }
    return NULL;
}

float run_x() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < MILLION; ++i) {
        calc_x();
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)*0.000000001;
}

float run_y() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < MILLION; ++i) {
        calc_y();
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)*0.000000001;
}

float run_xy() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < MILLION; ++i) {
        calc_x();
        calc_y();
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)*0.000000001;
}

float run_xy_parallel() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);
    for (int i = 0; i < MILLION; ++i) {
        trun = true;
        calc_y();
        while(trun) ++cnt;
    }
    clock_gettime(CLOCK_MONOTONIC, &end);
    return (end.tv_sec - start.tv_sec) + (end.tv_nsec - start.tv_nsec)*0.000000001;
}

int main() {
    trun = false;
    texit = false;
    pthread_create(&tid, NULL, &worker, NULL);
    printf("run_x: %.4f s\n", run_x());
    printf("run_y: %.4f s\n", run_y());
    printf("run_xy: %.4f s\n", run_xy());
    printf("run_xy_parallel: %.4f s\n", run_xy_parallel());
    printf("average nr of waiting loops: %.1f\n", (float)cnt/MILLION);
    texit = true;
    pthread_join(tid, NULL);
    return 0;
}

The results are:

run_x: 0.3049 s
run_y: 0.2559 s
run_xy: 0.5074 s
run_xy_parallel: 0.4744 s
average nr of waiting loops: 34.9

As you can see, there is practically no gain from using the parallel worker. The count (cnt) I'm using to try to measure how long time is spent on waiting suggests that 35/nloops = 35% of the time of calc_x is spend waiting for the other thread to finish, which seems right since there is a time difference between run_x and run_y. So I changed nloops to 1000 to see if anything changed. I got these results:

run_x: 2.4921 s
run_y: 2.4965 s
run_xy: 5.0950 s
run_xy_parallel: 3.3477 s
average nr of waiting loops: 44.8

Now the waiting count is less than 5% of calc_x. But still it's 1/3 slower than than what I expect.

It seems to me that setting trun takes alot of time, but it doesn't make sense that that time increases with nloops. Can anyone please explain this to me?

2 134 2

1 Jan 1970

Solution

I have done some investigation. I tried changing MILLION - it didn't help. Then I read about false sharing, and tried to work with local copies of x and y. new code of calc_x, calc_y :


void calc_x() {
    int xloc = 0;
    for (int i = 0; i < nloops; ++i) {
        xloc += i;
    }
    x = xloc;
}

void calc_y() {
    y = 0;
    int yloc = 0;
    for (int i = 0; i < nloops; ++i) {
        yloc += i;
    }
    y = yloc;
}

And problem completely disappeared! Here are times:

# non local version (MILLION is 10000 nloops = 100000)

run_x: 6.1838 s
run_y: 6.1812 s
run_xy: 12.3375 s
run_xy_parallel: 7.8648 s
average nr of waiting loops: 2741.4

# static version with x and y divided by buf[100]

static int x;
static int buf[100];
static int y;

... Result times: ...

run_x: 1.8734 s
run_y: 1.8513 s
run_xy: 3.7081 s
run_xy_parallel: 1.8359 s
average nr of waiting loops: 143.8


# MILLION is 10000 nloops = 100000

run_x: 1.7826 s
run_y: 1.7460 s
run_xy: 3.5052 s
run_xy_parallel: 1.7342 s
average nr of waiting loops: 242.5

# MILLION is 100000 nloops = 10000

run_x: 1.7939 s
run_y: 1.7632 s
run_xy: 3.5415 s
run_xy_parallel: 1.7667 s
average nr of waiting loops: 52.2

program sped up about 5 times (may be local variable can be easily moved to register???), and parallel version runs exactly 2 times faster, as it should. Almost the same speed up and parallel efficiency was achieved by putting 400 bytes buffer between x and y, so they can be cached independently. May be in your version static x and y near each other really prevented caching, as some people here wrote. And generally to gain with parallel computations you should hold programs as independent as possible.

2024-07-15

mishagam

Solution

Not sure why your question is getting downvoted that much. I've had to do this kind of performance analysis and it's both tricky and necessary for certain kinds of applications (e.g. financial services).

My best guess pinning threads to cores. Most decent-size servers have multiple CPU sockets, so you should taskset your process to ensure that it will run on cores restricted to a single NUMA zone. Even if your server only has a single CPU socket, Linux likes to migrate threads across cores, which prevents optimal cache usage. Ideally, you would explicitly pin the threads to cores chosen to be as close as possible, cache-wise. I have found that even on a single CPU system, the choice of cores to pin to has a significant impact on performance. Try different ones till you find a pair that gives you the best performance.

I think "false sharing" was a good guess. X86 processors all have 64-byte cachlines, so you should definitely add a char array of (64-sizeof(int)) between each of the globals (or at least the "hot" ones). In fact, I'm not sure C guarantees globals to be packed in a predictable way, so you might want to put them in a struct.

And your idea of increasing nloops was a good one. Synchronizing the threads after only 100 loops isn't enough.

A lot of my work has been related to network performance using kernel bypass drivers like Onload. Hardware configuration plays into it, trying to make sure that all memory accesses, including the NIC, are local. Interrupts also play a role, stealing cycles from the CPU you're trying to monopolize. There are any number of hacks you can do to isolate cores from the Linux scheduler and interrupt processing, although this can be quite a rabbit hole. Here are some links from several years ago when I did a bunch of research:

Here's another wonderful resource for seeing the consequences of various coding approaches: https://godbolt.org/

2024-07-15

Streve Ford