CN-TU/go-flows

How to properly use the CSV input feature

dcferreira opened this issue · 1 comments

I'm having some problems with labeled data using go-flows.

I have some pcaps, and a CSV file which states the existing attacks in the pcaps, along with their timestamp/IPs/ports/protocol.
I wanted to use the __label feature of go-flows to label the flows in the pcaps.

Here's what I did:

  1. Run go-flows just for packets, so that I have a list of packets with enough identifiable features that I can find the correct attack in the CSV file with the attacks.
    To do this, I ran
go-flows offline -stats -perpacket -filter "tcp or udp or icmp" features /features.json export csv /shared/exported.csv input INPUT_PCAPS

which generates an exported.csv file with one packer per line.

  1. Run a script which, for each packet in exported.csv, looks up if it is an attack in the original CSV file, and prints either "Normal" or its attack class (depending on whether or not it is found in the original CSV).
    Save this output to a new CSV file out.csv.
    This just has one class per line, plus a line for a header.
    It has exactly as many lines as exported.csv, meaning one line per packet

  2. Run go-flows again, with the same pcaps as in the beginning, and out.csv as CSV inputs for the labels (features.json here includes some feature that uses __label):

go-flows offline -stats -filter "tcp or udp or icmp" features /features.json export csv /shared/exported_flows.csv input /out.csv INPUT_PCAPS

I expected that exported_flows.csv had flows with correct labels, but they all seem to be incorrect.

Is this behavior expectable?
Is there some randomness that I'm not accounting for, or did I misunderstand something in how the labels in CSV file work?

notti commented

Beware, that the output order of the exporter is not non deterministic.
-> before the second run you have to sort the csv file