intel/opencl-intercept-layer

In API Trace Syntactic Consistency

trbauer opened this issue · 6 comments

Observed Behavior

For example clGetPlatformIDs omits arguments and uses , separator instead of : like all other API calls

>>>> clGetPlatformIDs, EnqueueCounter: 1
                     ^ uses comma
Host Time for call 1: clGetPlatformIDs = 1114623933
<<<< clGetPlatformIDs
....
>>>> clGetDeviceIDs: platform = [ Intel(R) OpenCL ], device_type = CL_DEVICE_TYPE_GPU (4), EnqueueCounter: 1
Host Time for call 3: clGetDeviceIDs = 292
<<<< clGetDeviceIDs

Probably the , belongs to the EnqueueCounter suffix,

  1. Another way to view this is clGetPlatformIDs omits its arguments.
  2. Perhaps suggest we should change the API trace to a more consistent syntax so that it's more consumable by tools? See below for a suggestion.

Desired Behavior

Perhaps we could use a syntax like the following.

<API-CALL> ::= <API-CALLING> <API-BODY> <API-CALL-RETURN>
<API-BODY> ::= anything but '<<<<'
<API-CALLING> ::=  '>>>>'  <THREAD-ID-ETC>? <API_NAME> ':' <ARG-LIST> <EXT-INFO>
<API-CALL-RETURN> ::= '<<<<' <THREAD-ID-ETC>? <API-NAME> ' returned '<RETURN-VALUE> (':' <CALL-RETURN-VALUES>)?
<ARG-LIST> ::=  <ARG-NAME> ' = ' <ARG-VALUE>  (',' <ARG-NAME> ' = ' <ARG-VALUE>)| 
<CALL-RETURN-VALUES> :: ':' <ARG-LIST> // values returned via pointer or return value
<EXT-INFO> ::=<empty-string> |  ';' <OTHER-STUFF-LIKE-ENQ-COUNTER>
<ARG-VALUE> ::= <ARG_SCALAR> | <ARG_PTR>
<ARG_SCALAR> = <HEX-VAL> |  '{' <HEX-VAL> (',' <HEX-VAL>)* '}' // structs for things like float4's
<ARG_PTR> = '[' <HEX-VAL> ']' // means the value returned is indirect

This would give us.

>>>> clGetPlatformIds: num_entries = 10, platforms = [0x12345...], num_platforms = [0x22345...]; EnqueueCounter:1
....
<<<< clGetPlatformIds returned CL_SUCCESS: platforms[] = {...platform_id's...}, *num_platforms = 2
// note the return by ptr values are decoded to a minimal extent using API domain specific knowledge (e.g. num_platforms is a single int)

Note, I am not tied to the exact syntax, but am just proposing something consistent for tool consumability. Probably there are a lot of improvements we could apply to the above.

Regarding API's that return a pointer (e.g. clCreateProgram) and return the status via pointer, we might consider untangling them logically to always present the cl_int (status) as the "returned" value and the new object allocated (e.g. program) as a pseudo argument like new_program. Or we just leave it alone and decode the status as a regular "return by pointer" argument (but decode it). I see benefits either way.

Steps to Reproduce

Enable the API trace (with enqueue counters) and observe clintercept_report.txt calls.

An alternate idea is to favor more C'ish syntax and list raw values with optional decoding as

<ARG-VAL-PTR> ::= <HEX-PTR> <DECODING>?
<DECODING> ::= '[' (<DECODING-BITSET> | <DECODING-PTR-LIST> | ... others) ']'

For example

 >>>> clCreateBuffer(0x0000029F50FAD910, 0xF3 [CL_MEM_READ_WRITE|CL_ALLOC_HOST_POINTER|0xF0], 0x400, 0x0000029F555AFG10, 0x1234566)
  • The first param cl_context doesn't have a "friendly" mapping so we just leave it a raw pointer.
  • The second parameter is a bit set decoding as an ORed' together. Any bits that we don't have a mapping for goes at the end as a raw hex value
  • The errcode_ret (final parameter) is a raw pointer 0x1234566 on input since we can only report it at call return. On the return statement we can report the same line, but only include values returned by pointer
    <<<< clCreateBuffer(....,0x1234566 [CL_SUCCESS]) returned 0x0000029F512513B0

The disadvantage to this is that it relies on the programmer knowing the argument name ordering or they have to take the step to look it up (not a hard step).

Yeah, although we try to maintain consistent output for API tracing / call logging, it's all very ad-hoc right now. The "real" way to fix this IMHO is to create a machine readable grammar file for OpenCL APIs and then to automatically generate the API tracing functionality, which would guarantee consistent output, but we aren't quite there yet.

In the meantime, a low hanging fruit change would seem to be to terminate the API trace with some other delimiter (say, ";") before appending any other call logging information, such as the enqueue counter. This is part of your proposed (formal) syntax:

<EXT-INFO> ::=<empty-string> |  ';' <OTHER-STUFF-LIKE-ENQ-COUNTER>

This is a trivial change. Should we start there, and then investigate a way to guarantee consistent output as part of a longer-term improvement?

Yeah. A generative process would be great and extensible to other API intercept projects. But since our API set is just OpenCL we could use an embedded-DSL approach as well. That is we could describe the API declaratively via regular C++ constructs within the intercept method and implement the behavior once for consistency.

E.g. for an intercept of: cl_int void foo(cl_kernel, int *, int)

void on_intercept_FOO(cl_kernel kernel , int *output, int *output_len, int scalar)
{
    return
            api_trace->intercept("foo").
              addArg<cl_kernel>("kernel").
              addOutputPtrArg<int>("output",lengthIs(output_len)).
              withReturn(
                    regular_intercept(k, output, output_len, scalar)
              );
    ... the above is a transparent call to regular_intercept if API logging is off
    ... if logging is on, then the API spec emits the necessary stuff on call entrance and call return
    ... we could use more template features to offload more of the implementation to compile-time
}

I haven't looked to how this fits into the project practice. I just point out that the approach can be embedded instead of having to have some tool generate everything. The OCL header file lacks some domain knowledge (e.g. what's an input pointer and what's an output pointer). So we'll need some annotation anyways.

Here's a little more concrete idea on how to embed this.

Here's an example API intercept with the adjustments.

CL_API_ENTRY cl_int CL_API_CALL CLIRN(clGetPlatformInfo)(
    cl_platform_id platform,
    cl_platform_info param_name,
    size_t param_value_size,
    void* param_value,
    size_t* param_value_size_ret )
{
    CLIntercept*    pIntercept = GetIntercept();

    if( pIntercept )
    {

        return logCall(
          "clGetPlatformInfo",
          [&] () {
            CPU_PERFORMANCE_TIMING_START();

            cl_int  retVal = CL_SUCCESS;

            if( pIntercept->overrideGetPlatformInfo(
                    param_name,
                    param_value_size,
                    param_value,
                    param_value_size_ret,
                    retVal ) == false )
            {
                retVal = pIntercept->dispatch().clGetPlatformInfo(
                    platform,
                    param_name,
                    param_value_size,
                    param_value,
                    param_value_size_ret );
            }

            CPU_PERFORMANCE_TIMING_END();

            // CHECK_ERROR( retVal ); // to be handled by LOG_RETURN_AS_STATUS

            return retVal;
          },
           LOG_PARAM_WITH(platform, formatPlatformId),       // formatPlatformId(platform)     => "platform = 0x1235 [ Intel (R) OpenCL ]"
           LOG_PARAM_WITH(param_name, formatPlatformInfo)    // formatPlatformInfo(param_name) => "param_name = 0x0900 [ CL_PLATFORM_PROFILE ]
           LOG_PARAM(param_value_size),                      // format(param_value_size)       => "param_value_size = 64"
           LOG_OUTPUT_PARAM_WITH3(param_value,formatParamValue,param_name,param_value_size_ret), // formatParamValue(param_value,param_name,param_value_size_ret) => "param_value = \"FULL_PROFILE\""
           LOG_RETURN_AS_STATUS                              // formatErrorCode(return value)  => "returned CL_SUCCESS"
          );
    }
    else
    {
        return dummyDispatch.clGetPlatformInfo(
            platform,
            param_name,
            param_value_size,
            param_value,
            param_value_size_ret );
    }
}

Note, I piloted the above intercept implementation for several other calls (e.g. clGetDeviceInfo, clCreateImage2D, clSetKernelArgument, to ensure the idea holds up consistently. I omit those for brevity.

This all leverages a single templated method logCall.

template <typename RET,PARAMS...>
RET CLIntercept::logCall(
           const char *apiName,
           std::function<R()> apiCall,
           PARAMS...>);

It uses variadic templates to process each parameter. The LOG_PARAM* macros will construct some data structure that describes how each parameter is to be captured. For the pedestrian parameters we can fallback on function overload semantics. A generic (overloaded) format function/method can be defined for all the types we care about (e.g. cl_int, cl_kernel (which emits extra info about the kernel pointer). For the special cases we can use LOG_PARAM_WITH to explicitly define the formatter function to be used. This is necessary for overlapped type uses. E.g. bunches of things use a cl_bitfield which is a cl_uint I think. So we have to nudge the overloading process in the right direction. Another example are error code values (both as parameters and return values). They are a cl_int, but we extra decoding to the enum symbol.

The API call argument is a std::function. Newer compilers should solve this since lambdas are becoming more and more common. (At least that's my working assumption.) Extra work that traces other things (e.g. host times) can be buried into this. Note, we could lift that effort into the logCall function as well though.

SINGLE LINE OUTPUT:
Though not strictly necessary for this same refactor I would proposes adjusting the call log syntax to be one line per call (except special calls that emit more information). We might have

API_CALL ::= IDENTIFIER '(' PARAMS? ')'
PARAMS ::= PARAM ',' PARAMS | PARAM
PARAM ::= IDENTIFIER ' = ' PARAM_VALUE
PARAM_VALUE ::= whatever is returned by the formatter chosen

Example output would be

>>>> clGetPlatformInfo(platform = 0x12345 [Intel (R) OpenCL], param_name = 0x0900 [ CL_PLATFORM_PROFILE ], param_value_size = 64, param_value = "FULL_PROFILE", param_value_size_ret = 13) => CL_SUCCESS

We could omit the parameter names for brevity if preferable

>>>> clGetPlatformInfo(0x12345 [Intel (R) OpenCL], 0x0900 [ CL_PLATFORM_PROFILE ], 64, "FULL_PROFILE",13) => CL_SUCCESS

This "single line" output has two implications.

  1. All parameters in OpenCL are input or output. None are both (inout in MS nomenclature). Otherwise, we'd have to emit before and after values. I think this holds true for OCL and we are safe here. Even if it did have one, it's a solvable problem with some sort of diffing syntax.

  2. The other implication is that we can't emit the API call until it returns (since we don't know the values to decode until on the way out). So if something bad happens (crash in the victim driver), the log will have incomplete information. An easy solution is to emit API call name (flush the output) and then delegate to the real API only after then. Then if it blows up the log at least has the function name (though not the parameters). A more sophisticated extension would be to chain SIGSEGV/SIGBUS handlers on Posix and use Windows structured exception handling (SEH), and emit the input parameters only upon crash (re-raising the signal after processing). Note, SEH doesn't work on MinGW, but maybe that's not a big deal.

Any information the apiCall lambda wants to log (to be emitted right after the call) could be written to a std::ostream passed into the lambda. This would enable additional information pertinent to the call to be listed after the API log part is done.

E.g. for clBuildProgram

 >>>> clBuildProgram(....) returned ...
 wrote file foo/bar/baz/CLI_0000_1AFFEF_0000.cl

The second line presumably comes from the lambda body.

 logCall(
     "clBuildProgram",
     [&] (std::ostream &os) {
        ....
                cl_int  retVal = pIntercept->dispatch().clBuildProgram(
                    program,
                    num_devices,
                    device_list,
                    options,
                    pfn_notify,
                    user_data );

              os << "wrote file " .... // <<<<< HERE
              ....

              return retVal;             
     }

Just to follow up on this (NOTE: we decided against this "functional" approach for the moment, but I wanted to record the gist of the idea). My experiments demonstrated a way to remove a lot of boilerplate code in the intercept calls. As a full example we could have clCreateKernel as something like the following.

CL_API_ENTRY cl_kernel CL_API_CALL CLIRN(clCreateKernel)(
    cl_program program,
    const char* kernel_name,
    cl_int* errcode_ret )
{
    INTERCEPT_OBJECT_CREATION(
        clCreateKernel, (program, kernel_name, errcode_ret ),
        CALL_LOG_PARAM( program ),
        CALL_LOG_PARAM( kernel_name ),
        CALL_LOG_PARAM_STATUS( errcode_ret ),
        CALL_LOG_BODY {
            // CPU_PERFORMANCE_TIMING_START();    automated (wrap the inner lambda), but this does then absorb the cost of everything in this block
            cl_kernel   retVal = NULL;

            if( pIntercept->config().OverrideBuiltinKernels )
            {
                retVal = pIntercept->createBuiltinKernel(
                    program,
                    kernel_name,
                    errcode_ret );
            }

            if( retVal == NULL )
            {
                retVal = pIntercept->dispatch().clCreateKernel(
                  program,
                  kernel_name,
                  errcode_ret );
            }

            // could be moved to a generic onObjectCreate(cl_kernel) handler (which would be called by interceptCreate (deduced by via return type inference)
            if( retVal != NULL )
            {
                pIntercept->addKernelName(
                    retVal,
                    kernel_name );
                if( pIntercept->config().PreferredWorkGroupSizeMultipleLogging )
                {
                    pIntercept->logPreferredWorkGroupSizeMultiple(
                        &retVal,
                        1 );
                }
            }

            // ADD_OBJECT_ALLOCATION( retVal ); // automated interceptCreate can infer this

            // CPU_PERFORMANCE_TIMING_END();     automated (wrap the inner lambda)

            return retVal;
        }
    )

An implementation of the macro INTERCEPT_OBJECT_CREATION could be the following.

#define INTERCEPT_OBJECT_CREATION( FUNCTION, ACTUAL_PARAMETERS, ... ) \
   CLIntercept*    pIntercept = GetIntercept(); \
   if ( pIntercept ) \
   { \
       CHECK_ERROR_INIT( errcode_ret ); \
       auto retVal = pIntercept->interceptCreate< decltype(FUNCTION ACTUAL_PARAMETERS) > ( \
           #FUNCTION, \
           __VA_ARGS__); \
       CHECK_ERROR( errcode_ret[0] ); \
       return retVal; \
   } \
   else \
   { \
       return dummyDispatch. FUNCTION ACTUAL_PARAMETERS ; \
   }

In the end we decided against the above judging it to be too invasive. This specific enhancement is trying to formalize the call log syntax, not remove boilerplate code or simplify the implementation in any way. In addition, we would have to be very careful we wouldn't break the current semantics. Maybe we can revisit this later.

A more conservative approach is to retain the type driven features of the above and simply edit calls to CALL_LOGGING_ENTER and CALL_LOGGING_EXIT. The output syntax will then be determined uniformly by parameter type only.

  1. Scalar types (non-pointer) are just reported as value. Signed will be reported base 10 and unsigned in base 16. We may make a special exception for char * and emit it as a string.
  2. Special built-in OpenCL types like cl_kernel and cl_program can get addition decoding treatment.
  3. Array values passed in are linked to the pertinent values that tell the API their supported and actual lengths.
  4. We also have "escape hatches" for exceptional situations (e.g. where a cl_int represents a status code or a void* to clSetKernelArgument needs additional decoding logic).