Pacsketch 🏃‍♂️

Pacsketch is a software tool that uses probabilistic sketch data-structures to determine if packet traces contain anomalous behavior. The data-structures that are available to be used are MinHash and HyperLogLog. The key idea of this approach is that if we build a sketch over the incoming network data (packet header/connection data) then we can compare it with existing sketches to see what traffic pattern it is most similar to in order to identify anomalous sets of data.

This project was completed for my course project in EN.601.714 - Advanced Computer Networks. The final paper as well as the presentations are saved in the deliverables/ folder.

Build

In order to build pacsketch, clone the repository and use cmake.

git clone --recursive https://github.com/oma219/pacsketch.git
cd pacsketch 

mkdir build && cd build
cmake ..
make install

The submodule can be updated by using the following command: git submodule update --init --recursive

Use

Pacsketch can be used through one of its sub-commands which include: build, dist, and simulate.

build - takes in an input dataset, and can build either the HyperLogLog or MinHash sketch and output the estimated cardinality
dist - takes in two input datasets, builds the sketches, and outputs the jaccard similarity between the two sketches
simulate - takes in a training and test set, simulates windows of records, and computes jaccard with respect to reference sketches

The build and dist sub-command can be used with either FASTA or networking dataset (NSL-KDD) as input. The FASTA input can be generated by using the utility programs shown below, it was used as test input during development. The simulate sub-command only accepts the networking dataset (NSL-KDD) dataset as input.

`build` sub-command

The build sub-commmand takes in a single dataset, and builds the requested sketch, and can output the cardinality of the sketch. The example command shows running the command with a networking dataset (NSL-KDD) to build a MinHash sketch, and the resulting cardinality is output.

# Command run ...
./pacsketch build -i normal1.csv -M -k 100 -c 

# Output ...
Estimated_Cardinality: 6700

`dist` sub-command

As mentioned above, the dist sub-command takes in two datasets, and computes the jaccard similarity as well as individual cardinalities for each dataset. The command below shows an example using two subsets of the NSL-KDD dataset.

It builds the MinHash sketches (k=100) for each individual file, and computes the jaccard similarity measure. In addition, it outputs the cardinalities of the individual sketches, and the union sketch.

# Command run ...
./pacsketch dist -i dataset1.csv -i dataset2.csv -M -k 100 

# Output ...
Estimated values based on MinHash sketches ...
  |SET(A)|  |SET(B)|     |SET(AUB)|    J(A,B)
      6700      7000            N/A    0.2739

`simulate` sub-command

This sub-command simulates windows of connection records with a certain percentage of anomalous records, and then compares it with the reference sketches for normal and anomalous records. For more information, on how to run this sub-command, check out the Analysis Scripts section below.

Utility Programs

generate_fasta

This utility program just generates a FASTA file which is the file format for DNA sequences. It was used as an approach to generate basic test data to test out the implementation of the data structures. The usage of the program is shown below, where it generates one sequence with a length of 30,000 bp. It also outputs the cardinality of the sequence in terms of 31-mers.

./generate_fasta -k 31 -n 1 -l 30000

generate_pair

This utility program generates a pair of FASTA files, as well as outputting the exact cardinality for each file for a certain kmer length. In addition, it outputs the exact jaccard similarity between the two FASTA files. The usage of the program is shown below, where this command will generate two FASTA files which are each 1,000,000 bp long along with the exact cardinality of the multi-set of 31-mers in each file. These cardinalities along with the exact jaccard similarity will be stored in a file with the suffix "stats.txt". All of these files will be stores using the provided output file prefix with -o option.

./generate_pair -k 31 -l 1000000 -o /Users/output_dir/prefix

analyze_dataset.py

This utility program both analyzes the KDD-Cup/NSL-KDD dataset as well as preprocesses the dataset in order to convert all the real features into discrete features. This is a necessary step prior to building or comparing sketches involving this network datasets.

This command will generate a series of output files using the provided output prefix (it is a good idea to make a specialized folder). The most important output are the *converted_*_dataset.csv files which essentially contain the NSL-KDD dataset but in the form where the features are converted to discrete features.

python3 analyze_dataset.py -i KDDTrain+.txt --kdd -o /path/to/updated_dataset/nsl_kdd \
-t KDDTest+.txt

Analysis Scripts

Experiment 1: Test the Cardinality and Jaccard Estimates of Each Data-Structure

This experiment will generate numerous sets of data in pairs with known jaccard similarity, and then will try to compute the cardinality and jaccard for each set of data. This experiment can be run using the command shown below, the only parameter that needs to be provided is the output directory for all the datasets and results. Then after running this command, the jaccard_exp_analysis.R script can be used to generate the plots for this experiment.

bash jaccard_exp.sh /path/to/output_dir

# Run jaccard_exp_analysis.R in the exp/ folder to analyze results

Experiment 2: Comparing NSL-KDD features across normal and attack records

This experiment was focused on better understanding the features available in the NSL-KDD dataset, and visually comparing their distributions to ensure they are different (since we want to use them for classification). The first step was to parse the dataset using the analysis_dataset.py script in the util/ folder. This script will generate various files that include data such as the class breakdown, mean/standard deviation of each feature across normal or attack records, as well as modified csv file to use for the next step.

The following command takes in the csv file produced by the python script to generate a couple of violin plots comparing features across the normal and attack connections.

python3 analyze_dataset.py -i /path/to/KDDTrain+.txt --kdd \
-o /path/to/output/nsl_kdd -t /path/to/KDDTest+.txt

# Run kdd_feature_analysis.R in the exp/ folder to analyze results

Experiment 3: Comparing Jaccard Distributions of Test Windows Against Reference Sketches

In this experiment, my goal was to better understand how the jaccard distribution with respect to the normal and attack references would change as the percent of anomalous records in the window is varied. In order to do this, in the main pacsketch tool I added a sub-command called simulate that will simulate windows of connections records with certain percentages of anomalous records and then computes the jaccard with respect to the two reference sketches.

An example command is shown below that generates 1000 test windows each containing 10,000 records stored in it where 40% of each window is composed of attack records. The output data is automatically written to stdout.

Each of the input *.csv files are produced by using the analyze_dataset.py utility program. This is important because it discretizes the input features prior inputing them into a sketch.

./pacsketch simulate -i normal_dataset.csv -i attack_dataset.csv -M -k 100 \
-n 10000 -w 1000 -a 0.4 > output_data.csv

# Run simulation_analysis.R in exp/ folder to analyze results

Experiment 4: Running Pacsketch on Test Data, and Comparing with Sampling

In this experiment, the goal is to compare Pacsketch to a sampling-based approach since it is a widely used technique in anomaly detection. Pacsketch's two main measurable metrics are classification of test windows, and estimates of the percentage of anomalous activity in a window. Since, it would require building a classification model to compare on the prior metric, I decided to compare the two approaches in terms of their estimates of the percentage of anomalous activity, assuming the sampler is an Oracle, and knows exactly whether the sampled record is normal or not.

The commands to run this experiment are shown below. It uses the simulate sub-command of Pacsketch, and provides two test files that are the test dataset from the NSL-KDD dataset. It simulates windows of data from those files at variable ratios of normal and attack records. It then uses the Pacsketch and Sampling method to estimate the percentage of records that are anomalous.

./pacsketch simulate -i train_normal_dataset.csv -i train_attack_dataset.csv \
                    -M -k 200 -n 9000 -w 1000   -t test_normal_dataset.csv \
                    -t test_attack_dataset.csv > estimated_attack_ratios.csv \
                    2> confusion_matrix_data.txt

# Run test_simulation_analysis.R in exp/ folder to analyze results

Experiment 5: Comparing Feature Distribution Across Different Attack Types in NSL-KDD

This experiment is aimed towards understanding if there is visual difference in the features of the security attacks in the NSL-KDD. The purpose of this experiment is to understand whether it would be possible to do multi-class classification of an anomaly, instead of just saying it is an anomaly without giving a specific type. In this experiment, I focused on three attack types (neptune, ipsweep, and smurf) since they are three of the most common attack types in the dataset.

The command below extract the records for those three attacks, and then convert them into csv file that can be used by the following R script to analyze the feature distributions.

python3 analyze_dataset.py --kdd -i KDDTrain+.txt -o /path/to/output_dir/nsl_kdd

# Run attack_feature_analysis.R in exp/ to analyze the results

oma219/pacsketch