This software enables the creation of a network intrusion dataset in CSV format. You can run it on a local server to create your own dataset or use this to read a PCAP from another source and convert that to CSV format based on the attributes you pick.
This program accepts a network log, pcap, and creates summary statistics using sliding window that moves through the log stream. The resulting CSV file contains one row of packet_dict for each time segment.
This runs as a multi-processing application with 4 python processes plus tshark
Stage | Python Module | Explanation | |
---|---|---|---|
Ethernet interface or pcap /pcapng |
| | data source for packet_dict | |
tshark not Python | | | converts to one line per packet json-sh format | |
subprocess pipe | | | communication between tshark and the Python program | |
PacketCapture - capture.py |
| | reads from tshark output - massages labels | |
sharedQ | | | communication Queue | |
PacketAnalyze - detectors.py |
| | filters using protocol detectors and protocol statistics | |
servicesQ | | | communicaton Queue | |
ServiceIdentity - services.py |
| | higher level TCP and UDP service counts | |
timesQ | | | communicaton Queue | |
TimesAndCounts - counts.py |
| | time windowing and file writer | |
csv file | | | feature file for model training |
tshark
captures live data or replays data from a pcap/pcapng file. It each packet as a line of text output in their ek format. I chose it because each record is on a single line so now multi-line json assembly is required. The Python processes launch it and listen to standard out.PacketCapture
is a python process that reads tshark and then transforms the data to make it more consumable. It converts the EK to true JSON and massages some of the label styles to json standard. The final text is pushed into a message queuePacketAnalyze
accepts the dictionary from the Queue. It creates a node pair identifier and identifies the protocol and forwards the original data, the id and protocol to the next stage via a Queue. PacketAnalyze also captures aggregated statistics across the run. Nothing is done with those at this time and they are lost when the program exists.ServiceIdentity
This module reads and ID, Protocol, packet data structure. It analyzes the packet to identify the higher-level service type of the message. Examples include DNS, SMTP, FTP, TLS, HTTP, SMB, SMB2, etc. The service list is added to the incoming data set and sent to a topic.TimesAndCounts
manages the time windows and calculates the time bucket/window statistics and writes them to output. it reads from the inbound topic and aggregates statistics across a set of incoming packets. The statistics are retained for a single time window and are written to csv file, one record for each time window.
The program creates a series of adjacent, non-overlapping, Tumbling Windows. Windows are defined by their maximum time span or their maximum event count. Each packet is included in just one window.
Each window starts at the start_time
and spans for some period of time or for some number of packets. This means windows are bound by time or bound by event counts.
Time bound windows run from start_time
until but not including the start_time + window_width
. The end_time
is the time of the last packet in the window
start_time
<=packet time
<start_time + window_width
- A packet can be flagged as more than one services. Services like SSDP are implemented using HTTP. That service is currently counted as both. This means you can see a HTTP with no TCP
- IPV6 traffic does not have a
ip.len
field. This means that thetcp_ip_length
value in the result set only includes ipv4 traffic. - Only a subset of IP protocols are picked up by the detectors http://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml and passed on to the counts module:
TCP, UDP, IGMP
. Packets for others are dropped. - Only a subset of non IP protocols are picked up and passed on to the counts module:
ARP
. Packets for others are dropped. - Runs as a multi-processing application because Python does not support parallel concurrent threads
- Was: This application has multiple concurrent threads but does not execute as parallel operations due to limitations in Python and the GIL.
- NBNS , SMB and SMB2 service counts have not ben vetted. They may be correct or overcount.
- Does not work with multiprocessing type
spawn
. Only works withfork
. Code adjusted to forcefork
in order to run on Mac with Python 3.8 or later.Spawn
is Mac default for 3.8+
tcp_frame_length | tcp_ip_length | tcp_length | udp_frame_length | udp_ip_length | udp_length | arp_frame_length | num_tls | num_http | num_ftp | num_ssh | num_smtp | num_dhcp | num_dns | num_nbns | num_smb | num_smb2 | num_pnrp | num_wsdd | num_ssdp | num_tcp | num_udp | num_arp | num_igmp | num_connection_pairs | num_ports | num_packets | window_start_time | window_end_time |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 2006 | 1084 | 1118 | 210 | 0 | 2 | 0 | 0 | 0 | 0 | 16 | 4 | 0 | 0 | 0 | 0 | 1 | 0 | 22 | 5 | 18 | 8 | 14 | 46 | 14806 | 19806 |
0 | 0 | 0 | 3479 | 2699 | 2487 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 6 | 15 | 2 | 0 | 0 | 0 | 2 | 0 | 28 | 0 | 6 | 4 | 8 | 34 | 19806 | 24806 |
0 | 0 | 0 | 16524 | 2781 | 14822 | 0 | 0 | 17 | 0 | 0 | 0 | 3 | 4 | 0 | 1 | 0 | 0 | 6 | 16 | 0 | 33 | 0 | 9 | 5 | 13 | 42 | 24806 | 29806 |
0 | 0 | 0 | 9798 | 1810 | 8636 | 84 | 0 | 18 | 0 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 18 | 0 | 23 | 2 | 2 | 5 | 7 | 27 | 29806 | 34806 |
0 | 0 | 0 | 16843 | 5915 | 15239 | 420 | 0 | 10 | 0 | 0 | 0 | 0 | 12 | 4 | 0 | 0 | 0 | 6 | 7 | 0 | 36 | 10 | 20 | 10 | 14 | 66 | 34806 | 39806 |
0 | 0 | 0 | 14842 | 7344 | 12918 | 168 | 0 | 33 | 0 | 0 | 0 | 1 | 10 | 2 | 0 | 0 | 0 | 0 | 15 | 0 | 46 | 4 | 6 | 8 | 12 | 56 | 39806 | 44806 |
0 | 0 | 0 | 8476 | 4324 | 7168 | 0 | 0 | 22 | 0 | 0 | 0 | 0 | 2 | 8 | 0 | 0 | 0 | 0 | 11 | 0 | 32 | 0 | 0 | 4 | 7 | 32 | 44806 | 49806 |
0 | 0 | 0 | 5126 | 2956 | 4244 | 0 | 0 | 5 | 0 | 0 | 0 | 0 | 6 | 6 | 2 | 0 | 0 | 2 | 3 | 0 | 23 | 0 | 0 | 4 | 11 | 23 | 49806 | 54806 |
0 | 0 | 0 | 2602 | 1535 | 1924 | 210 | 0 | 6 | 0 | 0 | 0 | 1 | 2 | 4 | 4 | 0 | 0 | 0 | 4 | 0 | 17 | 5 | 0 | 6 | 10 | 22 | 54806 | 59806 |
0 | 0 | 0 | 4914 | 2800 | 4168 | 84 | 0 | 6 | 0 | 0 | 0 | 0 | 3 | 3 | 3 | 0 | 0 | 2 | 3 | 0 | 19 | 2 | 0 | 5 | 12 | 21 | 59806 | 64806 |
6857 | 6615 | 6111 | 18677 | 9873 | 16171 | 504 | 0 | 16 | 0 | 0 | 0 | 7 | 21 | 2 | 4 | 0 | 2 | 4 | 7 | 13 | 59 | 12 | 21 | 16 | 33 | 105 | 64806 | 69806 |
6929 | 6747 | 6203 | 34439 | 17134 | 29359 | 420 | 0 | 31 | 0 | 0 | 0 | 5 | 23 | 30 | 2 | 0 | 15 | 6 | 14 | 13 | 120 | 10 | 24 | 15 | 36 | 167 | 69806 | 74806 |
29150 | 14857 | 26074 | 15555 | 8969 | 12973 | 0 | 0 | 17 | 0 | 0 | 0 | 0 | 13 | 17 | 4 | 0 | 5 | 2 | 7 | 46 | 63 | 0 | 4 | 11 | 24 | 113 | 74806 | 79806 |
If you are using this for research purposes please cite the publication listed below. The bibtex is as follows.
@INPROCEEDINGS{Raja1805:INSecS,
AUTHOR="Nadun Rajasinghe and Jagath Samarabandu and Xianbin Wang",
TITLE="{INSecS-DCS:} A Highly Customizable Network Intrusion Dataset Creation
Framework",
BOOKTITLE="2018 IEEE Canadian Conference on Electrical \& Computer Engineering (CCECE)
(CCECE 2018)",
ADDRESS="Quebec City, Canada",
DAYS=13,
MONTH=may,
YEAR=2018,
KEYWORDS="Network Intrusion Detection; Dataset creation; Security",
ABSTRACT="One critical challenge in design and operation of network intrusion
detection systems (IDS) is the limited datasets used for IDS training and
its impact on the system performance. If the training dataset is not
updated or lacks necessary attributes, it will affect the performance of
the IDS. To overcome this challenge, we propose a highly customizable
software framework capable of generating labeled network intrusion datasets
on demand. In addition to the capability to customize attributes, it
accepts two modes of packet_dict input and output. One input method is to collect
real-time packet_dict by running the software at a chosen network node and the
other is to get Raw PCAP files from another packet_dict provider. The output can
be either Raw PCAP with selected attributes per packet or a processed
dataset with customized attributes related to both individual packet
features and overall traffic behavior within a time window. The abilities
of this software are compared with a product which has similar intentions
and notable novelties and capabilities of the proposed system have been
noted."
}
You can find the original research paper on researchgate and related papers at University of Western Ontario
- Migrated from print() statements to logging. Logging levels and formats are configured in
logging_config.yaml
- Added IGMP counts
- Added num_smb, num_smb2, num_pnrp, num_wsdd, num_ssdp
- Added column that shows when that row ends
- Eliminated global variables other than the shared memory queues. This probably means this only works with
fork
- Unified pcap and live tshark into single set of classes
- Added command line options
- Added IPv6 to one of the detectors. Can't remember which one
- Migrated from multi-threaded to multi-processors to make use of multiple cores. A way to get around the GIL
- Added support for count based tumbling window. Now supports both time and count.
- Added command line option support for -wt or -wp.
- Supports either or both time based or count based window boundaries.
- The window behavior must be specified as a parameter in order to support one or both window parametrs.
- Force multiprocessing to run in
fork
mode. Linux does this natively. Mac Python 3.8 and later useSpawn
- Added unit tests
- Minimal requirements.txt added back
- Now counts non IP packets that were not analyzed separately from IP packets not analyzed
- Time and Count based Tumbling Windows for Network Packet Statistics https://youtu.be/6xa0fqRYpZM
- Tumbling time windows for network analysis https://www.youtube.com/watch?v=b3MaxbAAdDw
- Using Python to implement tumbling time windows for network analysis https://www.youtube.com/watch?v=jKgGh5a5gFA
-
Wireshark/Tshark (
tshark
) is installed, reachable and, on the PATH.. Installation would vary depending on your OS. Ubuntu install :sudo apt install tshark
-
This software is written in python3 so you will need to install python3. Most updated linux distributes already have it installed. Install it the way you wish. These were my notes.
sudo apt-get update sudo apt-get install python3.8.5 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.6 1 sudo update-alternatives --install /usr/bin/python python /usr/bin/python3.8 2 sudo update-alternatives --config python
or if you are running anaconda
conda update --prefix /home/joe/anaconda3 anaconda
-
The
requirements.txt
file has been re-added to the project without version numbers.pip3 install -r requirements.txt
-
You can manually install the dependencies
pip3 install pyyaml pip3 install pytest
-
Running in live capture mode may require sudo access to access the network in promiscuous mode. You will be prompted for a password at execution time
cmd = "sudo tshark -r /path/filename -V -T json"
-
Mac Python 3, cpython, requires a yaml install.
pip3 install pyyaml
-
pypy is slower than cpython as of 2022/10. If running pypy then you need to instal pyyaml:
pypy3 -mpip install pyyaml
Sample pcap files for testing can be found at https://wiki.wireshark.org/SampleCaptures
This project now has minimal unit tests to verify tumbling window behavor. The unit tests are require pytest
You must install the testing dependencies if you wish to run the tests.
python3 -m pip install pytest
pytest
Should see a variant of
============================================================================= test session starts ==============================================================================
platform linux -- Python 3.8.10, pytest-7.2.0, pluggy-1.0.0
rootdir: /home/joe/Network-intrusion-dataset-creator
collected 7 items
tests/tumblingwindow_test.py ..... [100%]
============================================================================== 5 passed in 0.01s ===============================================================================
- You can see the command line options
python3 main.py --help
$ python3 main.py --help usage: main.py [-h] [-s SOURCEFILE] [-i INTERFACE] [-l HOWLONG] [-o OUTFILE] [-wt WINDOWTIME] [-wp WINDOWPACKETS] [-t TSHARK] Create time/count window statistics for pcap/pcapng stream or file optional arguments: -h, --help show this help message and exit -s SOURCEFILE, --sourcefile SOURCEFILE provide a pcap input file name instead of reading live stream -i INTERFACE, --interface INTERFACE use an interface. [eth0] -l HOWLONG, --howlong HOWLONG number of seconds to run live mode. [120] -o OUTFILE, --outfile OUTFILE change the name of the output file [dataset.csv] -wt WINDOW, --windowtime WINDOW time window in msec [5000] -wp COUNT, --windowpackets COUNT max packets in window [None] -t TSHARK, --tshark TSHARK tshark command [tshark]
- The system needs to know the windowing parameters. Tumbling Window behavior is specified with either window time or the window packet. One must be specified.
- The default behavior is to work off of live tshark output. You can change this by setting the
--sourcefile
on the command line.- In this mode you will be running wireshark and capturing packets. These will be used to make your own dataset depending on the options you pick.
- The results are stored in a CSV file
dataset.csv
. You can override with the--outfile
command line option - You can set the capture time on a live network adapters with
--howlong <time>
option. The default is stored inset.py:how_long
. The time is seconds. - You can analyze an existing .pcap/.pcapng capture file and make a dataset in csv format. Specify the path to the input pcap/pcapng capture file with
--sourcefile <path>
The default is stored ininput_file_path
inset.py
- You can define a time window for each aggregation record. Specify the time in msec with the
--wt <size>
command line option. TThe default is stored insettings.py
. The time is in milliseconds. - You can define a packet window, the max number of packets, for each aggregation record. Use the
-wp <count>
command line option.
*Linux users can set the execute bit on main.py and run the main.py directly without the python3
part.
chmod +x main.py
Description | Command |
---|---|
Use 5000 msec window reading from Razi...pcap file and write output to dataset.csv | python3 main.py --sourcefile Razi_15012021.pcap -wt 5000 |
Use 5000 msec window reading from smtp-ssl.pcapng file from https://wiki.wireshark.org/SampleCaptures and write output to dataset.csv | python3 main.py --sourcefile smtp-ssl.pcapng -wt 5000 |
Use 100 packet window reading from smtp-ssl.pcapng file from https://wiki.wireshark.org/SampleCaptures and write output to dataset.csv | python3 main.py --sourcefile smtp-ssl.pcapng -wp 100 |
Try this
sudo tshark -i eth0 -a duration:120 -w /tmp/foo.pcap -F pcap
You will end up with one zombie python3 process if you ctrl-c
the command line you ran this under.
Run some version of this:
pkill -f tshark
pkill -f python3
This progam makes use of 5 cores, 4 for python Python and one for tshark.
It maxes out the cores so hyperthreaded cores will not count towards performance.
These tests were run on two different machines
- 16 core xeon v2 2.0/2.5 Ghz from SSD.
- 8 core Ryzen 5800X 3.8Ghz from NVMe
Sample | sample file size | real time | analyzed packets | time windows | sample period | python | CPU |
---|---|---|---|---|---|---|---|
Crylock | 143,446,091 B | real 1:43 user 1:40 sys 0:15 | n/a | n/a | 10.04 | tshark (only) | 16C Xeon E5 2640 V2 2.2Ghz SATA/SSD |
Crylock | 143,446,091 B | real 1:47 user 7:14 sys 2:36 | 128778 @ 1259/sec | 122 | 10:04 | cpython | 16C Xeon E5 2640 V2 2.2Ghz SATA/SSD |
Crylock | 143,446,091 B | real 1:15 user 5:17 sys 2:07 | 128778 @ 1578/sec | 122 | 10:04 | cpython | 20C Xeon E5 2680 V2 2.8Ghz SATA/SSD |
Crylock | 143,446,091 B | real 3:07 user 11:39 sys 1:21 | 128778 @ 754/sec | 122 | 10:04 | pypy 3.6 | 16C Xeon E5 2640 V2 2.2Ghz SATA/SSD |
Crylock | 143,446,091 B | real 2:18 user 08:29 sys 0:50 | 128778 @ 1035/sec | 122 | 10:04 | pypy 3.7 | 20C Xeon E5 2680 V2 2.8Ghz SATA/SSD |
Crylock | 143,446,091 B | real 0:21 user 1:38 sys 0:27 | 128778 @ 6150/sec | 122 | 10:14 | cypthon | 8C Ryzen 5800X NVME |
Razi | 767,491,552 B | 573523 @ 1106/sec | 112 | 09:21 | cpython | 16C Xeon E5 2640 V2 2.2Ghz SATA/SSD | |
Razi | 767,491,552 B | real 6:07 user 25:31 sys 11:17 | 573523 @ 1562/sec | 112 | 09:21 | cpython | 20C Xeon E5 2680 V2 2.8Ghz SATA/SSD |
Razi | 767,491,552 B | real 1:37 user 7:28 sys 2:17 | 573523 @ 5874/sec | 112 | 09:21 | cypthon | 8C Ryzen 5800X NVME |
Razi | 767,491,522 B | total 2:49 user 10:30 sys 1:40 | 573523 @ 3418/sec | 112 | 09:21 | cpython | 8C Macbook M1 |
This benchmark was for 2-queue 3-python process version. It was a test to see how much impact the queues vs the uplift of having extra processors. For this test we removed the queue between detectors and services.
Sample | sample file size | real time | analyzed packets | time windows | sample period | python |
---|---|---|---|---|---|---|
Crylock | 143,446,091 b | real:1:47 user:6:07 sys:1:22 | 128778 @ 1201/sec | 122 | 10:04 | cpython |
Maze | 1,045,083,415 b | real:11:21 user:38:38 sys:8:33 | 770,987 @ 1131/sec | 94 | 7:59 | cpython |
- Analysis times are linear with the number of packets processed
- Tested with ransomware samples from unavarra.es some of which may have originated on other sites.
- Running the 5 process (4 queue) version on quad core machines results in degraded performance by 10%. This is because we are CPU bound and have more processes that cores.
- Crylock and Razi retrieved from http://dataset.tlm.unavarra.es/ransomware/
- The Macbook M1 was limited by
tshark
performance as observed usingtop
. Thetshark
process was at 100% while all the others were at 50%
These examples all use the same sample data set available on the wireshark site
Purely time based window
~/Network-intrusion-dataset-creator$ python3 main.py --sourcefile smtp-ssl.pcapng -wt 5000
1 packetCount: 21 startTime: 11:31:42.005000 endTime: 11:31:42.450000
2 packetCount: 0 startTime: 11:31:47.005000 endTime: 11:31:47.005000
3 packetCount: 0 startTime: 11:31:52.005000 endTime: 11:31:52.005000
4 packetCount: 4 startTime: 11:31:57.005000 endTime: 11:31:58.335000
5 packetCount: 0 startTime: 11:32:02.005000 endTime: 11:32:02.005000
6 packetCount: 0 startTime: 11:32:07.005000 endTime: 11:32:07.005000
7 packetCount: 0 startTime: 11:32:12.005000 endTime: 11:32:12.005000
8 packetCount: 0 startTime: 11:32:17.005000 endTime: 11:32:17.005000
9 packetCount: 0 startTime: 11:32:22.005000 endTime: 11:32:22.005000
10 packetCount: 4 startTime: 11:32:27.005000 endTime: 11:32:29.517000
11 packetCount: 0 startTime: 11:32:32.005000 endTime: 11:32:32.005000
12 packetCount: 9 startTime: 11:32:37.005000 endTime: 11:32:41.025000
Purely time based window. The larger (>5000msec) means fewer windows.
~/Network-intrusion-dataset-creator$ python3 main.py --sourcefile smtp-ssl.pcapng -wt 10000
1 packetCount: 21 startTime: 11:31:42.005000 endTime: 11:31:42.450000
2 packetCount: 4 startTime: 11:31:52.005000 endTime: 11:31:58.335000
3 packetCount: 0 startTime: 11:32:02.005000 endTime: 11:32:02.005000
4 packetCount: 0 startTime: 11:32:12.005000 endTime: 11:32:12.005000
5 packetCount: 4 startTime: 11:32:22.005000 endTime: 11:32:29.517000
6 packetCount: 9 startTime: 11:32:32.005000 endTime: 11:32:41.025000
Maximum of 4 packets or 10 seconds whichever is first. The small packet max means more windows. There is one window in the middle that timed out before filling.
~/Network-intrusion-dataset-creator$ python3 main.py --sourcefile smtp-ssl.pcapng -wp 4 -wt 10000
1 packetCount: 4 startTime: 11:31:42.005000 endTime: 11:31:42.089000
2 packetCount: 4 startTime: 11:31:42.089000 endTime: 11:31:42.132000
3 packetCount: 4 startTime: 11:31:42.132000 endTime: 11:31:42.212000
4 packetCount: 4 startTime: 11:31:42.212000 endTime: 11:31:42.309000
5 packetCount: 4 startTime: 11:31:42.309000 endTime: 11:31:42.450000
6 packetCount: 1 startTime: 11:31:42.450000 endTime: 11:31:42.450000
7 packetCount: 4 startTime: 11:31:52.450000 endTime: 11:31:58.335000
8 packetCount: 4 startTime: 11:32:29.474000 endTime: 11:32:29.517000
9 packetCount: 4 startTime: 11:32:40.938000 endTime: 11:32:41.025000
10 packetCount: 4 startTime: 11:32:41.025000 endTime: 11:32:41.025000
11 packetCount: 1 startTime: 11:32:41.025000 endTime: 11:32:41.025000
Maximum of 20 packets or 10 seconds whichever is first. The large packet window size with the small data set results in several empty windows in the middle.
~/Network-intrusion-dataset-creator$ python3 main.py --sourcefile smtp-ssl.pcapng -wp 20 -wt 10000
1 packetCount: 20 startTime: 11:31:42.005000 endTime: 11:31:42.450000
2 packetCount: 1 startTime: 11:31:42.450000 endTime: 11:31:42.450000
3 packetCount: 4 startTime: 11:31:52.450000 endTime: 11:31:58.335000
4 packetCount: 0 startTime: 11:32:02.450000 endTime: 11:32:02.450000
5 packetCount: 0 startTime: 11:32:12.450000 endTime: 11:32:12.450000
6 packetCount: 4 startTime: 11:32:22.450000 endTime: 11:32:29.517000
7 packetCount: 9 startTime: 11:32:32.450000 endTime: 11:32:41.025000
The source tree is formatted with black in Visual Studio Code extension