IntelligentSoftwareSystems/Galois

OpenMP + Galois

Yang-Yihang opened this issue · 9 comments

I have some C++ code using two external libraries. The first one uses OpenMP for parallelization, while the second one uses Galois. After creating a galois::SharedMemSys object in the main function

int num_of_thread = 6;
galois::SharedMemSys G;
galois::setActiveThreads(num_of_thread);

it has been observed that code written in OpenMP will be executed sequentially.

This is my first time using Galois, I may not configure it correctly. What would be the correct way to use OpenMP alongside with Galois?

Hello.

  1. Are you setting the number of OMP threads elsewhere? galois's setActiveThreads call does not set OMP threads.
  2. Are you running the OMP parallel region concurrently with a Galois parallel region?

Thanks,
Loc

Hi, thank you for your response.

  1. To answer the first question, I did the following testing again and again. The testing script is at the end of this comment. At the beginning of the main function, there are two code sections
int main() {
    // section A for Galois
    if (USE_GALOIS) {
        int num_of_thread_galois = 6;
        galois::SharedMemSys G;
        galois::setActiveThreads(num_of_thread_galois);
    }
    // end of section A

    // section B for OpenMP
    int num_of_thread_openmp = USE_SINGLE_OPENMP ? 1 : 6;
    omp_set_num_threads(num_of_thread_openmp);
    // end of section B

   ....
}

This testing is performed on a 6 core Ubuntu machine with hyperthreading disabled.
If section A is disabled, then the library using OpenMP can utilize all 6 threads, and the runtime of 6 threads is shorter than a single thread.
The result for 1 thread:

$ ./cg
Building triplets time: 0.435237
Setting matrix time: 2.92087
Start solving
Thread used: 1
Solve matrix time: 9.42129

The result for 6 threads:

$ ./cg
Building triplets time: 0.471058
Setting matrix time: 2.82994
Start solving
Thread used: 6
Solve matrix time: 3.74979

If section A is enabled, then setting num_of_thread_openmp larger than 1 leads to slowdown. Setting num_of_thread_openmp to 1 gives the normal single thread runtime.
With Galois initialized, the result for using one OMP thread:

$ ./cg
STAT_TYPE, REGION, CATEGORY, TOTAL_TYPE, TOTAL
Building triplets time: 0.500948
Setting matrix time: 2.8253
Start solving
Thread used: 1
Solve matrix time: 9.45835

and 6 thread:

$ ./cg
STAT_TYPE, REGION, CATEGORY, TOTAL_TYPE, TOTAL
Building triplets time: 0.517053
Setting matrix time: 2.93068
Start solving
Thread used: 6
Solve matrix time: 14.7703
  1. The external library using Galois has not been integrated yet. Once integrated, OMP parallel region and Galois parallel region are supposed to run sequentially.

Testing code:

#include <Eigen/Core>
#include <Eigen/Sparse>
#include <iostream>
#include <vector>
#include <sys/time.h>
#include <omp.h>
#include <galois/Galois.h>

using namespace Eigen;
using namespace std;
// Use RowMajor to make use of multi-threading
typedef SparseMatrix<double, RowMajor> SpMat;
typedef Triplet<double> T;
// Assemble sparse matrix from

double get_wall_time() {
  timeval time;
  if (gettimeofday(&time, nullptr)) {
    //  Handle error
    return 0;
  }
  return (double) time.tv_sec + (double) time.tv_usec * 0.000001;
}

void buildProblem(vector<T>& coefficients, VectorXd& b, int n) {
    b.setZero();
    
    for (int i=0; i<n; ++i) {
        b[i] = 1;
        coefficients.push_back(T(i,i,1));
        for (int j=0; j<5; ++j) {
            int id = random() % n;
            //if (id >= n || id < 0) cout << id << "\n";
            //cout << id << "\n";
            coefficients.push_back(T(i,id,-0.1));
            coefficients.push_back(T(id,i,-0.1));
        }
    }

}

#define USE_GALOIS false
#define USE_SINGLE_OPENMP false

int main() {
    // section A for Galois
    if (USE_GALOIS) {
        int num_of_thread_galois = 6;
        galois::SharedMemSys G;
        galois::setActiveThreads(num_of_thread_galois);
    }
    // end of section A

    // section B for OpenMP
    int num_of_thread_openmp = USE_SINGLE_OPENMP ? 1 : 6;
    omp_set_num_threads(num_of_thread_openmp);
    // end of section B

    int n = 2000000; // size, 2M
    // Assembly:
    double wall_time = get_wall_time();
    vector<T> coefficients; // list of non-zeros coefficients
    VectorXd b(n); // the right hand side-vector resulting from the constraints
    buildProblem(coefficients, b, n);
    wall_time = get_wall_time() - wall_time;
    cout << "Building triplets time: " << wall_time << endl;


    SpMat A(n,n);
    wall_time = get_wall_time();
    A.setFromTriplets(coefficients.begin(), coefficients.end());
    wall_time = get_wall_time() - wall_time;
    cout << "Setting matrix time: " << wall_time << endl;
    // Solving:
    // Use ConjugateGradient with Lower|Upper as the UpLo template parameter to make use of multi-threading
    cout << "Start solving\n";
    wall_time = get_wall_time();
    ConjugateGradient<SpMat, Lower|Upper> solver;
    solver.compute(A);
    printf("Thread used: %d\n", nbThreads());
    VectorXd x = solver.solve(b); // use the factorization to solve for the given right hand side
    wall_time = get_wall_time() - wall_time;
    cout << "Solve matrix time: " << wall_time << endl;
    return 0;
}

Compile command:

g++ -std=c++17 cg.cpp -O3 -I./eigen -I$GALOIS_INSTALL_DIR/include -L$GALOIS_INSTALL_DIR/lib -lgalois_shmem -lnuma  -fopenmp -o cg

Using C++17 standard as required by Galois, including Eigen (3.3.7) library and Galois library, linking against Galois::shmem and libnuma-dev.

Is there anything wrong or inappropriate which might lead to the slowdown of Galois + OpenMP than OpenMP only?

Hi.

The way your code is written, the Galois runtime is actually destroyed before execution reaches the OMP part of your code. This is quite interesting as that means theoretically the threads the Galois takes control of should be freed up for use, but as your runtimes show that doesn't seem to be the case.

I will test your program locally and see if I find anything interesting.

Hi.

I created my own test program based on what you told me and was able to replicate the behavior.

May you prefix your runs with GALOIS_DO_NOT_BIND_THREADS=1 and try your test again? It should fix the issue that you are seeing.

Please let me know how it works.

Hi Loc,

Thank you for your time and help. It fixes the issue. It works no matter when the Galois runtime is destroyed.
Destroyed before OMP code

$ GALOIS_DO_NOT_BIND_THREADS=1 ./cg
STAT_TYPE, REGION, CATEGORY, TOTAL_TYPE, TOTAL
Building triplets time: 0.497852
Setting matrix time: 2.86269
Start solving
Thread used: 6
Solve matrix time: 3.68029

Destroyed after OMP code

$ GALOIS_DO_NOT_BIND_THREADS=1 ./cg
Building triplets time: 0.527362
Setting matrix time: 2.89404
Start solving
Thread used: 6
Solve matrix time: 3.79623
STAT_TYPE, REGION, CATEGORY, TOTAL_TYPE, TOTAL

I am wondering if I can set/check this GALOIS_DO_NOT_BIND_THREADS environment variable using a Galois function.

Best regards
Yihang

This isn't exactly a fix but rather it's more of a workaround; I don't expect your use case to work if Galois isn't destroyed as the threads will remain bound, but when it is destroyed DO_NOT_BIND_THREADS should not be necessary. Note that thread binding does have a performance impact, so this isn't an ideal workaround. We'll let you know when we get to fixing this.

The runtime checks for an environment variable, so any method of setting an environment var in code should work. You would have to do it before Galois is initialized so that it picks it up.

Got it. Many thanks for your explanation.
Because in my project, Galois parallel region and OMP parallel will run sequentially, I think what I actually need is that Galois runtime can allow OMP to freely use all cores/threads.
I will do more testing to see if there are any other issues. Please keep me posted. Thank you again.

Random idea here: given that the Galois and OpenMP thread pool are distinct, it'd probably work to just use GALOIS_DO_NOT_BIND_MAIN_THREAD=1 instead. I suspect what's happening is that OpenMP is detecting that you have the affinity set for the main thread and trying to respect that configuration choice.

Maybe it would make sense for us to restore the affinity of the main thread to whatever it was before when the SharedMemSys object gets destroyed? You could probably do that manually as a workaround for now.

Hi Ian,

Sorry for the late reply.
I made a small change in the above cg example because I actually want the SharedMemSys to get destroyed right before exiting the main function so that in the future I can add some Galois parallel regions. The snippet of code now becomes

int main() {
    // section A for Galois
#if USE_GALOIS
    int num_of_thread_galois = 6;
    galois::SharedMemSys G;
    galois::setActiveThreads(num_of_thread_galois);
#endif
    // end of section 

Now I just create a SharedMemSys object without using any Galois iterators yet, I can see that both GALOIS_DO_NOT_BIND_THREADS=1 and GALOIS_DO_NOT_BIND_MAIN_THREAD=1 allow OpenMP code to use multiple cores.

Here is the result:
Using the default Galois settings, 6 threads will run on one core, leading to slowdown.

$ ./cg
Building triplets time: 0.514851
Setting matrix time: 2.80922
Start solving
Thread used: 6
Solve matrix time: 13.7807
STAT_TYPE, REGION, CATEGORY, TOTAL_TYPE, TOTAL

Setting GALOIS_DO_NOT_BIND_THREADS=1 gets OpenMP code back to normal. Since I do not have Galois code running now, the side effect is unknown.

$ GALOIS_DO_NOT_BIND_THREADS=1 ./cg
Building triplets time: 0.507869
Setting matrix time: 2.78096
Start solving
Thread used: 6
Solve matrix time: 3.77377
STAT_TYPE, REGION, CATEGORY, TOTAL_TYPE, TOTAL

Setting GALOIS_DO_NOT_BIND_MAIN_THREAD=1 also gets OpenMP code back to normal.

$ GALOIS_DO_NOT_BIND_MAIN_THREAD=1 ./cg
Building triplets time: 0.499516
Setting matrix time: 2.8956
Start solving
Thread used: 6
Solve matrix time: 3.69182
STAT_TYPE, REGION, CATEGORY, TOTAL_TYPE, TOTAL

Maybe I will just set GALOIS_DO_NOT_BIND_MAIN_THREAD=1 for now, and keep you posted when I get some Galois code integrated with OpenMP code.