Prepostprocessor for a new dataset

Question

Prepostprocessor for a new dataset

hallavar opened this issue 2 years ago · 1 comments

Hello,

I'm trying to adapt your solution for a new dataset based on CIFlowMeter features extractor.

I would like to know what are the expected output of the preprocess and the postprocess.

Because I already have my data in the format require by doppelGANger, so I was thinking about just loading my data in these functions.

In the output of preprocess, before being given to training step, should all my attributes be continuous, or should I keep some discrete attributes that will be manage by word2vec, and if so, where do I indicate which attribute is continuous and which attribute is discrete ?

Also, in your example on zeek, you didn't change any argument in the PrePostProcessor config field of the config file

},
  "pre_post_processor": {
      "class": "ZeeklogPrePostProcessor",
      "config": {
          "norm_option": 0,
          "split_name": "multichunk_dep_v2",
          "df2chunks": "fixed_time",
          "full_IP_header": true,
          "encode_IP": "bit"
      }

Wouldn't be a problem anywhere else down the pipeline ?

And if so, could you please provide some info on what argument does what ? But if it is not mandatory t change it, we can just keep it like this.

Thanks in advance.

Answer 1 · 2023-07-27T02:13:34.000Z

Hi, thanks for your interest in our work. Not sure if you noticed but we recently refactored our code to PyTorch and are still actively working on that -- feel free to try it out.

Regarding your questions,

Here is a brief guide to tackle different types of variables (e.g., continuous or discrete). Here is a more detailed example when applied to a concrete dataset.
Using the default config should be fine for most cases. We will try to add more info regarding different arguments in the future README.

Let us know if you have further questions and reopen the issue if necessary.