Assignment-Solution

Speaking of the capacity usage patterns, we can see by mere obeservation or by plotting a bar graph, that the capacity usage shoots to it's peak (global maxima) during the evening hours which is way above the total provisioned load of 4000. It is also during the midnight when the capacity usage is way above 4000. Thus it may imply that it is during these hours of the day when the users mostly run their computations. On the contrary, the load seemingly decreases just before midnight (global minima) and a few hours before midnight too. The load also remains quite low early in the morning and the hours just after noon which imply that users hardly run any computations during these hours.

Speaking of the missing timestamps, three situations can be inferred:

  1. In the first case, we can consider the capacity usage in the missing timestamps to be exactly 4000 which is why the data may not have been recorded as the total provisioned load was completely used up. This means that the given table of capacity usage consists of only the outlier data points where the capacity usage was not equal to 4000. But, this would also mean that in day consisting of 86400 seconds, there must have been only 22 outliers and in the rest 86378 timestamps, the data fitted perfectly (i.e. capacity usage = 4000). It would thus mean that capacity usage was 4000 99.97% times, which makes 4000 the optimal choice and wouldn't require any further optimisation.
  2. In the second case, we can consider the capacity usage in the missing timestamps to be 0 which would then require optimisation. But on the other hand it wouldn't make sense for the capacity usage to be 0 throughout the day in 86378 timestamps, which would lead to utter inefficiency and failure of the whole goal.
  3. Finally, after brief observation of the simulated data, one can conclude that for every consecutive timestamp (difference of one second), the change in the capacity usage is quite small as compared to timestamps which are minutes away. Eg. The computation capacity used from 12:01:28 AM to 12:01:38 AM ranges within 5500 to 6500 whereas, one could expect a decline of the computation capacity used after 12:01:38 AM to 12:03:24 AM (which may be a local minima). This way, we could expect a continuous trend in the missing timestamps considering few of the given timestamps to be local maximas and local minimas. This is quite analogous to the number of users watching a live stream, where between any two timestamps having a difference of only few seconds will only have a gradual change in the number of users and not a huge jump.

In order to optimise the above problem, we make a few assumptions. First, we consider that the capacity usage trend presented in the data is almost similar to the trend in the complete data. This will ensure that performing the optimisation in this data will not bring much difference if performed in the complete data, and the optimum value of capacity purchased will more or less be the same in both the cases. Second, we consider the price of 1 unit of capacity per second is equal to 1 rupee for simplicity. We then apply a bruteforce by plotting a Capacity versus Cost graph for all the capacities starting from 1 to 13760. This will give us the points where the cost is minimum. The entire analysis is performed using Python and depicted in the above Jupyter Notebook. Apart from using this bruteforce, we could also use optimisation algorithms like Gradient Descent which would simplify the entire process. In addition to this, we can also use the method of Load Shifting wherein we purchase a higher computation capacity only for those timestamps where the load is high (during the evening) and purchase the usual amount of capacity when the load is low. This will also ensure that we don't pay that extra amount during peak load hours.