Previously trained messages cannot always be matched

Question

Previously trained messages cannot always be matched

Impelon opened this issue 3 years ago · 7 comments

Observe the following MWE (using the current version 0.9.7):

>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("training4Model start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'training4Model start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_template_changed', 'cluster_id': 1, 'cluster_size': 2, 'template_mined': '<*> start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel stop', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
False

Since loadModel start was previously passed in to parser.add_log_message, it should be possible to match it later in the "interference" mode.
I would expect previously trained messages to always match later on.
As illustrated above, this cannot be guaranteed for arbitrary messages.

I've analyzed what is happening by looking through the source code and came to the following explanation:

When parser.match("loadModel start") is called, there are 2 clusters present with the following templates:
1. <*> start: This is the cluster that includes the first occurrence of loadModel start.
2. loadModel stop
Thus when tree_search is called, loadModel matches the prefix of 2nd template and 1st template is completely disregarded. (line 131)
Obviously loadModel start does NOT exactly match loadModel stop and as such no match is found. (line 134)
https://github.com/IBM/Drain3/blob/6b724e7651bca2ac418dcbad01602a523a70b678/drain3/drain.py#L121-L135

The reason the first template was modified to include a wildcard (which later prevents the log message from matching), is that training4Model includes a number and thus will replaced by a wildcard if no other prefix matches.
See: https://github.com/IBM/Drain3/blob/6b724e7651bca2ac418dcbad01602a523a70b678/drain3/drain.py#L196-L199
The following example works perfectly fine therefore:

>>> from drain3 import *
>>> parser = TemplateMiner()
>>> parser.add_log_message("trainingModel start")
{'change_type': 'cluster_created', 'cluster_id': 1, 'cluster_size': 1, 'template_mined': 'trainingModel start', 'cluster_count': 1}
>>> parser.add_log_message("loadModel start")
{'change_type': 'cluster_created', 'cluster_id': 2, 'cluster_size': 1, 'template_mined': 'loadModel start', 'cluster_count': 2}
>>> parser.add_log_message("loadModel stop")
{'change_type': 'cluster_template_changed', 'cluster_id': 2, 'cluster_size': 2, 'template_mined': 'loadModel <*>', 'cluster_count': 2}
>>> parser.match("loadModel start") is not None # This message was seen during training, so it should match.
True

I'm not sure how this should be optimally handled.

davidohana commented 3 years ago

Fixed.

Answer 1 · 2021-12-28T11:15:29.000Z

Indeed an interesting edge case. I am not sure how acute is it, since another call to parser.add_log_message("loadModel start") would fix it too, and mostly you get many logs of the same type so you can say that match() is eventually consistent. But I guess it depends on the use case.

I committed two possible ways to solve it:

Option to disable treating numeric tokens as parameters. This will make Drain treat training4Model as a non-parametric token like trainingModel. But this is only a patch because it might also occur if the masker decides to parameterize some token. Nevertheless, I think parametrize_numeric_tokens should be disabled if you have good masking instruction coverage.
A more complete solution - added strategy to match() function, to fallback to full cluster search in case of no match. This may cost in performance of the match() but is good for a case in which match accuracy is super important. It's possible to implement a recursive tree_search (first try exact token match and fallback to wildcard match). This will be both accurate and more efficient than the linear full search, but I am unable to implement if now due to lack of time.

WDYT?

Answer 2 · 2021-12-28T14:13:06.000Z

Thank you for the quick response and improvements made!
It's great to have more flexibility.

Option to disable treating numeric tokens as parameters. [...] Nevertheless, I think parametrize_numeric_tokens should be disabled if you have good masking instruction coverage.

I believe this would indeed solve the problem in my case.

A more complete solution - added strategy to match() function, to fallback to full cluster search in case of no match. This may cost in performance of the match() but is good for a case in which match accuracy is super important.

In my case I it is important that I don't modify existing templates after the training phase. The most I could do in case match() fails is to add a new template. For me the new fallback strategy would be a good fit, just as an extra precaution in case setting parametrize_numeric_tokens = False does not resolve everything.

Just as an extra note, I think the current documentation for the always strategy is a bit misleading:
https://github.com/IBM/Drain3/blob/8c1e94d74b682b38d18b8091480c58f762acd491/drain3/drain.py#L357-L359
As far as I understand it suffers from the same problem as the fallback-strategy in the sense that a

non-optimal match with more wildcard parameters than necessary

can be found, and this is even more likely here since a restricted search was not run before.

Another thing to note is that fast_match does not necessarily require a list of IDs, just a sequence.
So just passing self.id_to_cluster.keys() will be more memory efficient and slightly faster because no list is created, but I suspect that there are few situations where this will actually makes a difference.
https://github.com/IBM/Drain3/blob/8c1e94d74b682b38d18b8091480c58f762acd491/drain3/drain.py#L373-L376

Answer 3 · 2021-12-28T15:12:48.000Z

always should always return the best match as opposed to fallback, because fallback will return the best match only among the clusters in the same leaf node, in case tree search found a match. always compares all clusters with the same sequence length, starting from the root. Tried to clarify that in the docs.
Thanks for the comment regarding list cast - removed it.

Answer 4 · 2021-12-28T15:16:23.000Z

I see now I have a bug in always, working on a fix.

Answer 5 · 2021-12-28T16:22:13.000Z

always should always return the best match as opposed to fallback, because fallback will return the best match only among the clusters in the same leaf node, in case tree search found a match. always compares all clusters with the same sequence length, starting from the root. Tried to clarify that in the docs.

Got it now! Thanks again for the quick fixes.

Answer 6 · 2021-12-29T07:49:17.000Z

Changes are available at v0.9.8, released today.