How to properly use the CSV input feature
dcferreira opened this issue · 1 comments
I'm having some problems with labeled data using go-flows.
I have some pcaps, and a CSV file which states the existing attacks in the pcaps, along with their timestamp/IPs/ports/protocol.
I wanted to use the __label
feature of go-flows to label the flows in the pcaps.
Here's what I did:
- Run go-flows just for packets, so that I have a list of packets with enough identifiable features that I can find the correct attack in the CSV file with the attacks.
To do this, I ran
go-flows offline -stats -perpacket -filter "tcp or udp or icmp" features /features.json export csv /shared/exported.csv input INPUT_PCAPS
which generates an exported.csv
file with one packer per line.
-
Run a script which, for each packet in
exported.csv
, looks up if it is an attack in the original CSV file, and prints either "Normal" or its attack class (depending on whether or not it is found in the original CSV).
Save this output to a new CSV fileout.csv
.
This just has one class per line, plus a line for a header.
It has exactly as many lines asexported.csv
, meaning one line per packet -
Run go-flows again, with the same pcaps as in the beginning, and
out.csv
as CSV inputs for the labels (features.json
here includes some feature that uses__label
):
go-flows offline -stats -filter "tcp or udp or icmp" features /features.json export csv /shared/exported_flows.csv input /out.csv INPUT_PCAPS
I expected that exported_flows.csv
had flows with correct labels, but they all seem to be incorrect.
Is this behavior expectable?
Is there some randomness that I'm not accounting for, or did I misunderstand something in how the labels in CSV file work?
Beware, that the output order of the exporter is not non deterministic.
-> before the second run you have to sort the csv file