libcu++ should not define host/device/forceinline (or any other runtime macros)

In our __config we have the following:

libcudacxx/include/cuda/std/detail/__config

Lines 14 to 39 in b8b37d6

    
           #ifdef __CUDACC__ 
        
               #if defined(__clang__) 
        
                   #include <cuda_fp16.h> 
        
                   #define __fp16 __half 
        
               #endif 
        
               #if defined(__FLT16_MANT_DIG__) 
        
                   #include <cuda_fp16.h> 
        
                   #define _Float16 __half 
        
               #endif 
        
               #define _LIBCUDACXX_CUDACC_VER_MAJOR __CUDACC_VER_MAJOR__ 
        
               #define _LIBCUDACXX_CUDACC_VER_MINOR __CUDACC_VER_MINOR__ 
        
               #define _LIBCUDACXX_CUDACC_VER_BUILD __CUDACC_VER_BUILD__ 
        
               #define _LIBCUDACXX_CUDACC_VER                                                  \ 
        
                   _LIBCUDACXX_CUDACC_VER_MAJOR * 100000 + _LIBCUDACXX_CUDACC_VER_MINOR * 1000 + \ 
        
                   _LIBCUDACXX_CUDACC_VER_BUILD 
        
           #else 
        
               #ifndef __host__ 
        
                   #define __host__ 
        
               #endif 
        
               #ifndef __device__ 
        
                   #define __device__ 
        
               #endif 
        
               #ifndef __forceinline__ 
        
                   #define __forceinline__ 
        
               #endif 
        
           #endif

this will define things like __host__ to be empty when we detect that we're not compiling with nvcc in order to silence any errors about unrecognized symbols that would occur from host compilers that don't understand what __host__ is.

That is a reasonable thing in many cases, but there are legitimate use cases where this can break. For example, the following file compiled with a host compiler will fail to compile:

#include <cuda/std/type_traits>
#include <cuda_runtime_api.h>

This will fail due to a redefinition error of __host__ and friends because cuda/std/type_traits will detect we're not compiling with nvcc and define __host__ to empty, but then cuda_runtime_api.h will also define __host__, causing the redefinition error.

One could reasonably argue that the runtime headers should better guard against this situation by undefining __host__ before defining it. However, it is most expedient for us to fix this in libcu++.

The only robust solution I can think of is to avoid ever defining __host__/__device__/__forceinline__ ourselves and instead use our own macros like:

#if defined(__CUDACC__)
  #define _LIBCUDACXX_HOST __host__
#else 
  #define _LIBCUDACXX_HOST
#endif

And use those in place of any where we'd use __host__ (or other macros).

I'd expect we should rarely be using host/device annotations directly anyways. Those should come from __LIBCUDACXX_INLINE_VISIBILITY or just _LIBCUDACXX_EXECUTION_SPACE_SPECIFIER.

It looks like the number of places we use __host__ or __device__ directly are pretty small, and they are almost always together.

We got another two related bug reports, so I have opened #476

	#ifdef __CUDACC__
	#if defined(__clang__)
	#include <cuda_fp16.h>
	#define __fp16 __half
	#endif
	#if defined(__FLT16_MANT_DIG__)
	#include <cuda_fp16.h>
	#define _Float16 __half
	#endif
	#define _LIBCUDACXX_CUDACC_VER_MAJOR __CUDACC_VER_MAJOR__
	#define _LIBCUDACXX_CUDACC_VER_MINOR __CUDACC_VER_MINOR__
	#define _LIBCUDACXX_CUDACC_VER_BUILD __CUDACC_VER_BUILD__
	#define _LIBCUDACXX_CUDACC_VER \
	_LIBCUDACXX_CUDACC_VER_MAJOR * 100000 + _LIBCUDACXX_CUDACC_VER_MINOR * 1000 + \
	_LIBCUDACXX_CUDACC_VER_BUILD
	#else
	#ifndef __host__
	#define __host__
	#endif
	#ifndef __device__
	#define __device__
	#endif
	#ifndef __forceinline__
	#define __forceinline__
	#endif
	#endif