[RFE] Support linking multiple Thrust versions: Add hooks that wrap the `thrust::` namespace in a custom namespace

Question

[RFE] Support linking multiple Thrust versions: Add hooks that wrap the `thrust::` namespace in a custom namespace

nv-dlasalle opened this issue 3 years ago · 10 comments

Problem

Cub allows itself to place into a namespace via CUB_NS_PREFIX and CUB_NS_POSTFIX, such that multiple shared libraries can each utilize their own copy of it (and thus different versions can safely coexist). Static variables used for caching could otherwise cause problems (e.g., https://github.com/NVIDIA/cub/blob/main/cub/util_device.cuh#L212).

Thrust however depends on cub and requires it to not be in another namespace, so users cannot have CUB_NS_PREFIX defined. This means if two libraries use two different versions of thrust (or cub), issues with the caching variables inside of cub can occur.

Possible solutions

A solution would be to add THRUST_NS_PREFIX and THRUST_NS_POSTFIX to allow each library to place the version of thrust it's compiling against within in it's namespace, and either utilize the version of cub in the global namespace, or utilize the version of cub within the same namespace by defining CUB_NS_PREFIX as well.

Another solution, would be to allow users to define something like THRUST_CUB_NS, to tell thrust which namespace to look for cub in:

#define CUB_NS_PREFIX=namespace foobar {
#define CUB_NS_POSTFIX=}
...
#define THRUST_CUB_NS=foobar
#include "thrust/sort.h"

Answer 1 · 2021-03-19T18:28:43.000Z

Updated the title mention the work that needs to be done.

I think we can nicely solve this with these macro sets:

THRUST_CUB_WRAPPED_NAMESPACE
- Set this to a namespace name that will wrap thrust:: AND cub::.
- Preferred method for most usecases.
THRUST_WRAPPED_NAMESPACE / CUB_WRAPPED_NAMESPACE
- Set these to a namespace name that will wrap thrust:: OR cub::.
- Available for odd usecases.
- Overrides THRUST_CUB_WRAPPED_NAMESPACE.
[THRUST|CUB]_NS_[PREFIX|POSTFIX]
- Implementation details, but may be overridden for backwards compat.
- Overrides any overlapping *_WRAPPED_NAMESPACE definitions.
- Macros containing the actual code that implements the namespace wrappings.

@nv-dlasalle Does this sound reasonable for your needs?

Answer 2 · 2021-03-19T23:53:20.000Z

@allisonvacanti This sounds like it would work perfectly for us. Thanks!

Answer 3 · 2021-03-24T10:42:46.000Z

PyTorch is experiencing the same issues pytorch/pytorch#54245. Probably CUB should avoid using static variable for cacheing in the template function

Answer 4 · 2021-03-24T14:40:46.000Z

Some context:

Those caches were added a while back to avoid overhead from some expensive CUDA API calls. Users were seeing a significant impact from these calls under certain workloads, and the caches were necessary for good performance in some critical applications.

I agree that using statics in a header is a fragile solution and, well, generally not a good idea. But we don't really have a lot of other options -- Thrust/CUB are header-only, so we can't place the cache in a library component. C++17 inline variables may provide a nicer workaround eventually, but we can't rely on them yet.

For now, the namespace workaround will be the preferred solution, but I just wanted to share that we're aware of the issue and want to move to a more robust solution when one becomes available.

Answer 5 · 2021-03-24T16:04:59.000Z

Thanks! I found it should be fine using static variable inside a non-template non-inline function. In those cases, gcc won't compile these symbols as UNIQUE. And if every library uses RTLD_LOCAL without UNIQUE symbols, they can only see their own static variables, which avoids the conflict problems.

However, the UNIQUE symbol breaks the RTLD_LOCAL setting, that later library loaded won't instantiate its own static variable. This causes the conflict

Answer 6 · 2021-03-24T16:40:31.000Z

If each library indeed saw the same static variable, that would be fine - the values that are cached are supposed to be the same for all libraries. But in case of pytorch/pytorch#52663 and pytorch/pytorch#54245 a new static is allocated, but its constructor is not called, so there are 0 devices instead of correctly cached number of devices (same would happen for other cached attributes).
That said, separate namespaces for multiple thrust/cub versions sound good.

Answer 7 · 2021-03-25T06:40:40.000Z

Separate namespace could solve this issue, which is also the solution DGL team've adapted dmlc/dgl#2758. However, I still prefer to share my investigation here for people to understand the root cause and avoid similar issue in the future. It took us about one week to figure out the root cause.

I believe the cause is the UNIQUE symbol, however I found there's limited resources explaining this. I'm not an expertise in C++ so my statement here could be wrong.

Some solutions I found

passing -Xcompiler=-fno-gnu-unique to the nvcc compiler, which force compiler not to create any unique symbol.
Second possible solution I found is to make the symbol and the function as hidden when exporting the symbol, this will also change the symbol from UNIQUE to LOCL (but I haven't tried this yet)
Third solution is to set the function as static, which has similar effect as the second solution

Using template function with static variable will result in some unusual behavior. I hope my investigation here can help people find a better solution.

Reference:

A simple gist for the UNIQUE symbol: dmlc/dgl#2758

Answer 8 · 2021-06-21T16:51:02.000Z

#1464 and NVIDIA/cub#326 provide the new namespace customization hooks. With those applied, defining THRUST_CUB_WRAPPED_NAMESPACE="foo" before including any Thrust/CUB headers will move the thrust:: and cub:: namespaces to foo::thrust:: and foo::cub::. By specifying different namespaces for different dynamic libraries, collisions and ambiguity can be avoided.

Alternatively, if THRUST_CUB_USE_ANON_NAMESPACE defined, all of thrust:: and cub:: will be placed in anonymous namespaces. Similar to the suggestion of using static functions, this should also address the issue, though it will likely bloat binary size as symbols can no longer be reused by different TUs in each library.

I'm wrapping up testing and reviews, but these are passing initial tests. If anyone gets a chance to try these out and see if they fix their dynamic linking problems, please let me know.

Answer 9 · 2021-07-19T19:35:03.000Z

The fix for this has landed. Define THRUST_CUB_WRAPPED_NAMESPACE to a unique name for each library linked together, and all of thrust:: and cub:: will be placed in the requested outer namespace. This will avoid symbol collisions between libraries.

For more info:

Thrust implementation: https://github.com/NVIDIA/thrust/blob/main/thrust/detail/config/namespace.h
CUB implementation: https://github.com/NVIDIA/cub/blob/main/cub/util_namespace.cuh

Answer 10 · 2021-07-19T19:35:49.000Z

(Note that the anonymous namespace macros have been removed. These were interacting badly with nvcc, and will not be implemented in the forseeable future.)