Distribution parameters
Closed this issue · 6 comments
Hello!
I have generated some trace files with Gadget and plotted the distributions. I have played around a bit with the parameters to understand how the distribution changes. I did not fully understand what the parameters a and b (i.e. the range) did to the distributions as well as how the constant affected the distribution. It would be much appreciated if I could get clarification about these parameters and how they affect the distribution.
Best regards,
Fredrik
Hi,
Could you please let me know which experiment/distribution you are referring to? Thank you
I'm referring to the constant distribution and the uniform distribution in the "key popularity distribution for the first event generator" configuration. Below are examples of 2 plots I created that plots the key popularity distribution for traces with constant distribution with the constant parameter being 1 and 10 respectively and also 2 plots that plot uniform distribution with parameters (a=1, b=2) and (a=1, b=5) respectively. It would be much appreciated if you could explain why the distributions look the way they do and how the parameters affect those distributions.
Thank you. I am not sure what are x-axis and y-axis in your figures.
The event key popularity is a mechanism for choosing the keys for generated events. Each event is assigned to one or more windows in a windowed operator. Each window has a unique key that is derived from the window timestamp and event keys.
I think when event keys follow uniform distribution, the keyspace size of the Gadget-generated trace is bigger than the constant distribution case.
PS: The event keys play an important role in load-balancing or sharding. If event keys follow uniform distribution, it will be easier to divide the load among different operators than when all event keys have the same key (constant distribution).
Thank you for taking the time to answer my questions!
My bad for not writing out what the x-axis and y-axis represent. The x-axis is each key in the trace (in these plots 10000 operations are generated) and the y-axis is the number of times each key is accessed, no matter what operation. The data is also sorted by the number of accesses to get an overview of the popularity distribution. By looking at the x-axis of the plots of the constant distribution compared to the x-axis at the uniform distribution plots, I agree that it seems as if the constant distribution has smaller key spaces. And also the keyspace size increases when increasing the range of the uniform distribution. By running the python script again with printed keyspace size I can confirm that the constant=1 has a keyspace size of 108 keys, constant=2 has a size of 107 keys, uniform a=1 b=2 has a size of 212 keys and uniform a=1 b=5 has a size of 505 keys.
And don't think I fully understand what a constant key distribution is. Would you mind explaining what that means and how the constant affect the distribution? And also how the range (the "a" and "b" parameters) in the uniform distribution affect the distribution.
Sure. I hope you find the following helpful.
What is the event key:
Each event digested by a stream processer system (SPS) is typically associated with a key. We offer multiple key popularity distributions, including constant and uniform distributions.
Constant means all events digested by SPS have the same key (which might be rare in real practice).
Uniform (a, b) means events keys range from a to b in a uniform manner.
What is the trace generated by Gadget?
Depending on the type of operator (e.g., tumbling window), Gadget mimics the internals of an SPS and generates the operator state store workload. The figures that you plotted are state store workloads generated by Gadget. If it is a window operator, similar to an SPS, Gadget fragments the input event stream into smaller batches called windows. Gadget assigns each window a unique key (These are the keys that you see and plotted). The relationship between event keys and state store keys depends on the operator type. We discussed some aspects of this relationship in the first part of the Gadget paper.
Thank you very much for the clarification!