CESNET/ipfixcol2

Plugin to merge IPFIX data

cicciob95 opened this issue · 8 comments

Is there a plugin that merges the IPFIX statistics of different packets related to the same flow but coming from different observation points?

Hi,

unfortunately there is not plugin for this.

What you are trying to achieve is not entirely trivial for a general implementation on the collector side. It would require implementing a separate flow cache in the collector and instructing the collector how to merge the individual flows. Since the form of the flow records exported by different exporters may vary, it is not entirely known what flow aggregation key was used on the probe (unless it is a standard 5-tuple) and what flow fields should be merged and how.

Lukas

Perfect @Lukas955
Thanks for the precise explanation. Then I will try to implement a plugin that will correctly merge the collected statistics. Is there any wiki on how to implement a plugin for ipfixcol?

Unfortunately there is no manual how to write an intermediate plugin at the moment.

Before we start describing how to write a new module, I'm interested in your use-case. How you use the collector and why you need to merge flows in this way. Isn't it possible to do this sometime later when querying over the flows?

The idea is to merge the statistics collected by different switches in different ways (e.g. time series extraction of statistics for each flow) in order to analyze the flows through an ML/DL model. Using pull-based logic it may be possible to do this when flows are requested. But prefer to use a push-based logic in order to transfer a small amount of data, which can then be preprocessed and analyzed directly. Furthermore, by generating normal and malicious traffic in a testbed to actually test the functioning of the models, in this second way I could also directly save the records with another output plugin so that if the models do not perform well I can retrain or improve them.

Sorry, for longer reaction.

If I understand it correctly, what you need is flow deduplication i.e. that the same flow record reported by multiple switches appears at the collector output only once. Am i right? Do you eventually need to extract and merge some fields from individual flows?

By the way, how do you plan to work with the data? Are you using e.g. JSON output or something else?

Hi @Lukas955,
you understand what I should do, but I don't need to extract and merge some fields from single flows, rather I want to analyze single flows independently of the switches that manage them.

As for the output, I still haven't decided which format to use because it is currently of secondary importance. Since ML models are likely to be used by a REST service, it is likely that the processed data will be sent to the service in a JSON body.

From my point of view, at the moment it is probably the easiest (and fastest) to make a prototype of flow deduplicator in Python that will receive flow records in JSON form.

The JSON output of the collector can be configured (see detailedInfo option) to include information about the exporter i.e. IP address, Observation Domain ID (ODID), etc. The deduplicator tool would then only pass unique flow records. The algorithm could work by storing the flow key (i.e. src/dst IP, src/dst port, and protocol) in a dictionary and remembering where the record came from (exporter + ODID) and when the record was seen (i.e. the timestamp of the start and end of the record). If a record came from a different exporter (switch) within the same timeframe, it would likely be a duplicate.

Hi @Lukas955
Thanks for the advice. Now I will consider how to proceed.