/Sketches

his Python script implements a Count-Min Sketch data structure for estimating flow frequencies, using hash functions to increment and query counter values for each flow.

Primary LanguagePython

Sketches

This is a Python script that implements a Count-Min Sketch data structure for estimating flow frequencies. The class SketchStatistics takes in the width, depth, number of flows, and number of increments as parameters. The sketch uses hash functions to increment and query counter values for each flow. The incFlow method increments the counters for the given flow by hashing and updating the appropriate counter values. The queryFlow method queries the minimum counter value for the given flow by hashing and finding the minimum value among the appropriate counters.

The runSimulationAndCalculateNormalizedRMSE method is used to simulate the Count-Min Sketch and calculate the Normalized RMSE over a set of flows. It increments the counters randomly and updates the real values. It computes the error between the estimated and real frequencies for the flow and appends it to the error_value list. It then computes the Root Mean Square Error (RMSE) and Normalized RMSE over all flows using the error_value list. Finally, it writes the results to a file and returns them as a dictionary.

The main function initializes the Count-Min Sketch with default values for width, depth, and number of flows. It then runs a simulation for different numbers of increments, ranging from 100 to 10000 in increments of 500. For each simulation, it creates a new instance of SketchStatistics and calls runSimulationAndCalculateNormalizedRMSE. It also calls plot_counters to generate a plot of the counter values.

Overall, this script provides a way to evaluate the accuracy of the Count-Min Sketch and compare different increments. The Normalized RMSE is a useful metric for measuring the accuracy of the Count-Min Sketch in estimating the frequencies. This implementation uses hash functions to increment and query counter values for each flow, making it a memory-efficient way to estimate flow frequencies in large datasets.