logpai/logparser

How to represent variable log input format

haraldott opened this issue · 1 comments

Dear authors,

First of all, thanks for your work.
this is more of a question, than an issue, but here it goes:

I'm trying to parse cloud foundry logs, using for example Drain or Spell.
I want to adapt your implementation to be as general as possible, for my Master's thesis, and use it for the parsing part.
Unfortunately, the experiments you've conducted using only 2.000 lines of log output, are in no relation to a realistic use case where you use 100k - 1m lines of log output.

I am using the output log format that you provide in the benchmark files for cloud foundry.
<Logrecord> <Date> <Time> <Pid> <Level> <Component> \[<ADDR>\] <Content>
The regex provided there is not doing well and produces a completely blown up template file.
So I'm using this regex, added some myself:

regex = [
        r'((\d+\.){3}\d+,?)+',
        r'/.+?\s',
        r'\d+',
        r'\[.*?\]',
        r'\[.*\]',
        r'\[.*\] \[.*\]',
        r'(/|)([0-9]+\.){3}[0-9]+(:[0-9]+|)(:|)',  # IP
        r'(?<=[^A-Za-z0-9])(\-?\+?\d+)(?=[^A-Za-z0-9])|[0-9]+$',  # Numbers
        r'\(\/.*\)'
    ]

I end up with a template file that's like the one I've attached.
openstack_val_normal_n2_templates.csv.zip

Still, as you can see in the logs for example for eventId 1, 2, 3 (I've changed your md5 EventIds to ascending ones), the directory is not being parsed well, so if I would have a log file for my model with different names of the vm instances, this won't work.
Of course, I could start putting there a regex for directories, but I'm not sure if that's the right approach.

Do you have any suggestions on how to improve the situation?

How do you actually use your model when you use it on more realistic use case? Do you have a larger collection of regexs?

Help would be much appreciated.

Sorry for my late reply.

I think having a regex for directories is a reasonable solution. In practice, we also use regex a lot if we found this could solve the problem easily. Unfortunately we didn't maintain a regex list for reuse. We will consider maintaining such a regex list later. Thanks for you questions!