jhauberg/comply

Lack of pre-processing can trigger violations in unexpected ways

Closed this issue · 3 comments

An example for too-many-params:

screen shot 2018-04-12 at 12 07 22

Occurs for this piece of code:

#define DOUBLE_ROUND(v0,v1,v2,v3)		\
	HALF_ROUND(v0,v1,v2,v3,13,16);		\
	HALF_ROUND(v2,v1,v0,v3,17,21);		\
	HALF_ROUND(v0,v1,v2,v3,13,16);		\
	HALF_ROUND(v2,v1,v0,v3,17,21);

The match doesn't stop until it reaches the first semi-colon, so it thinks there's more parameters than there actually is (in the macro). I think the better solution is to just avoid checking stuff that is pre-processed (at least in this particular case).

It may be worth adding another stripping step that removes #defines. Alternatively, making this rule ignore matches inside #define directives.

However... Anything related to pre-processed code is just real complicated. It's not as simple as "just ignoring pre-processor directives", because those macros can be used elsewhere in all kinds of ways that do not fit with what our patterns expect for proper C.

Here's a complicated example:

screen shot 2018-04-12 at 12 15 27

#define DEFINE_GETTERS(name, action, ...) \
int lily_##name##_boolean(__VA_ARGS__) \
{ return source  action->value.integer; } \
uint8_t lily_##name##_byte(__VA_ARGS__) \
{ return source  action->value.integer; } \
lily_bytestring_val *lily_##name##_bytestring(__VA_ARGS__) \
{ return (lily_bytestring_val *)source  action->value.string; } \
lily_container_val *lily_##name##_container(__VA_ARGS__) \
{ return source  action->value.container; } \
lily_coroutine_val *lily_##name##_coroutine(__VA_ARGS__) \
{ return source  action->value.coroutine; } \
double lily_##name##_double(__VA_ARGS__) \
{ return source  action->value.doubleval; } \
lily_file_val *lily_##name##_file(__VA_ARGS__) \
{ return source  action->value.file; } \
FILE *lily_##name##_file_raw(__VA_ARGS__) \
{ return source  action->value.file->inner_file; } \
lily_function_val *lily_##name##_function(__VA_ARGS__) \
{ return source  action->value.function; } \
lily_hash_val *lily_##name##_hash(__VA_ARGS__) \
{ return source  action->value.hash; } \
lily_generic_val *lily_##name##_generic(__VA_ARGS__) \
{ return source  action->value.generic; } \
int64_t lily_##name##_integer(__VA_ARGS__) \
{ return source  action->value.integer; } \
lily_string_val *lily_##name##_string(__VA_ARGS__) \
{ return source  action->value.string; } \
char *lily_##name##_string_raw(__VA_ARGS__) \
{ return source  action->value.string->string; } \
lily_value *lily_##name##_value(__VA_ARGS__) \
{ return source  action; } \

DEFINE_GETTERS(arg, ->call_chain->start[index], lily_state *source, int index)
DEFINE_GETTERS(as, , lily_value *source)

Here's a regex that might work for stripping pre-processor directives like #define:

\#(?:[^\\\n]|\\[^\n]|\\\n)+\n

source: https://www.lemoda.net/c/c-regex/

However, i'm not sure just stripping these would really be all that useful. In the above example, the usage of the DEFINE_GETTERS macro is actually the bigger issue.

Additionally, stuff might hide inside the pre-processed block that is of interest to parse (e.g. prefer-stdint).

Particularly rules using function patterns could just be improved to match prototypes and signatures better.

Though it still fails some cases, here's an improved regex for finding function signatures and prototypes. It assumes that the input source has been stripped of function bodies to weed out most false-positives:

(?P<name>\b\w*)\s*\((?P<params>[^&%^?#@!/<>=+\-{};]*)\)(?=\s*[{;])

Problems:

  • Matches if, for and similar expressions
    • mitigated by having stripped function bodies- would still match on outer code, though
    • pattern could be extended to exclude these keywords:
(?P<name>\b\w*)(?<!if)\s*\((?P<params>[^&%^?#@!/<>=+\-{};]*)\)(?=\s*[{;])

  • Matches macros in some cases, for example:

screen shot 2018-04-18 at 13 30 36

This particular case is pretty bad, because it results in missing the actually interesting function, rather than just having an additional bad hit.

  • We want to allow macros in function definitions (e.g. void func(int param_a, MY_MACRO(param_b));), and the occasional const enum suit (* const suits)[PLAYER_SUITS] parameter so we can't just add () to the excluded characters of the params group- that would otherwise fix it:
(?P<name>\b\w*)\s*\((?P<params>[^&%^?#@!/<>=+\-{}();]*)\)(?=\s*[{;])

UPDATE:

This pattern seems successful in only matching signatures and prototypes, given the rule that both of those are always preceded by a type or *, and are not preceded by a return:

(?<!return)(?:(?<=[\w*])\s+|\*)(?P<name>\w+)\s*\((?P<params>[^&%^?#@!/<>=+\-{};]*)\)(?=\s*[{;])

(This pattern shouldn't even need function bodies to be stripped first)

Closing for now. It will likely come up again when introducing new rules.