y-scope/clp

Questions about intermediate results

Closed this issue · 2 comments

Request

Great works! I had compressed and decompressed lots of logs successfully, and I had some problems understanding the code in depth.

  1. After a file is compressed, how do I obtain the intermediate result generated during the compression, such as the extracted template?

  2. I read logtype.dict and var.dict myself, but I found that they were all numbers, and I didn't know what they meant. In my understanding, logtypes and templates should be in string form, but I can't get these string templates at the moment.

  3. I'm very interested in figuring out what all these staging files, logtype.dict, logtype.segindex, metadata, metadata.db, var.dict, var.segindex, mean. And what are the files 0, 1, 2, 3, and so on in folder s.

Possible implementation

Can you provide some scripts to read the aforementioned files?
Also, some scripts to generate intermediate results.

Hi @liuruijin17,

Thanks for your kind words. Apologies for the delaying in replying! Here are some answers to your questions:

After a file is compressed, how do I obtain the intermediate result generated during the compression, such as the extracted template?

The extracted templates (what we call logtypes) are stored in the logtype dictionary (logtype.dict).

I read logtype.dict and var.dict myself, but I found that they were all numbers, and I didn't know what they meant. In my understanding, logtypes and templates should be in string form, but I can't get these string templates at the moment.

The two dictionaries use a binary format and are also compressed using Zstd. We have a tool, make-dictionaries-readable, that can decode them into a human-reable format.

I'm very interested in figuring out what all these staging files, logtype.dict, logtype.segindex, metadata, metadata.db, var.dict, var.segindex, mean. And what are the files 0, 1, 2, 3, and so on in folder s.

Below is a brief description of the files in each archive. In case you haven't checked it out already, our research paper has more details about how CLP converts a log message into the format in the archives.

  • metadata - Archive metadata like the timestamp range of events in the archive
  • metadata.db - A SQLite database containing metadata (e.g., timestamp range, original file path, etc.) about files and empty directories compressed in this archive.
  • logtype.dict - A dictionary of the logtypes for events in this archive.
  • var.dict - A dictionary of the (unencodable) variables from events in this archive.
  • All files in the s directory are segments. A segment is a Zstd-compressed file containing encoded files back-to-back.
    • An encoded file is the tuple of the timestamp column, logtype ID column, and variables (IDs and encoded variables) column, written in that order.
  • logtype.segindex - Is a list of unique logtype IDs in each segment. We use this to build an in-memory index of logtype ID -> segment that can be used to improve search performance.
  • var.segindex - Similar to logtype.segindex but for dictionary variables.

Can you provide some scripts to read the aforementioned files?

Assuming make-dictionaries-readable works for you, the only file with a custom format would be metadata. We can look into making this human-readable.

Also, some scripts to generate intermediate results.

Besides the logtype, are there any other intermediate results you're missing?

Thank you very much for your careful answers. I found that most of the problems were caused by my not reading your code carefully.

I'm sorry to reply to you today due to the Chinese National Day holiday. Thank you again for your excellent code and comprehensive support. I will close this issue.