Improve performance of stats deletion background jobs

We needed to temporarily disable stats deletion background jobs because they are inefficient and take too much time to complete.

Here are some numbers that can help us find a more efficient solution:

Different periods of times that we can find in stats keys for a whole year: 1 (year) + 12 (months) + 52 (weeks) + 365 (days) + 365*24 (hours) = 9190.
Total number of stats keys in a year: 9190 * n_services * n_apps * n_metrics + 9190 * n_services * n_metrics. The first part corresponds to application stats and the second one to service stats.

I also measured the runtime for different operations that we need to perform in this kind of background job. Bear in mind that these are just approximations. There are many factors that can alter these numbers (redis latency, CPU, etc.):

Number of keys that can be generated in a second: 27k.
How long it takes to delete keys (calling redis.del(keys)): 10k keys in batches of 50, takes 0.15s.
Jobs can be enqueued at 750 jobs/s.

According to the numbers above:

A job that deletes all the stats generated for a whole year for a specific {service, app, metric} combination would take 9190/27000 = 340 ms to generate the keys plus another ~150ms to delete them. That's like half a second in total. Deleting the stats at the service level for a given metric would take the same.
A job that does the same for a month instead of a year would take around 40ms. That would be close to what other kinds of jobs take.

I choose to give the number for partitioning by year and month, but we could choose other granularity.

Regarding response codes, I think they can be treated as 8 extra metrics (202, 403, 404, 500, 504, 2xx, 4xx, 5xx).

A possible implementation would be as follows: the job that generates partitions takes into account all the factors (app, metrics, period of time) and generates small jobs that take a reasonable time to generate the subset of keys that they were assigned and delete them.

As an example, we have seen a job that generates around 10M keys. In order to delete all those keys, with the approach of splitting the work by 1service-1app-1metric-1year as mentioned above. We'd need to generate around 10M/9190 = 1088 jobs. Enqueuing those jobs would take 1088/750 = 1.45s. If we partitioned by month instead of by year, we'd need 10M/(9190/12)=13056 jobs. Which would take 17.4s to be enqueued.

This approach would be much more efficient than the current one. However, the time to enqueue all the smaller jobs could be an issue.

The enqueue time could be reduced if we enqueued jobs using pipelines, which as far as I know, is not something that the Resque client that we are using supports, but should be doable. Alternatively, we could make those calls in parallel.

We could also try to find different ways to generate keys more efficiently. We could analyze the code that does that with stackprof or similar and see if we can optimize something.

@eguzki and I discussed an alternative approach. It would be a recursive approach. Once a job is executed, it would delete some keys and create a new job that has fewer applications, fewer metrics, a shorter period of time, or some combination of all that. In the end, we'd have a job without apps, metrics, and an interval of 0s. Then, we'd know that we've finished deleting all the keys of the original job. This approach removes the cost of enqueuing a large number of jobs in a single job, because each one just enqueues another one. However, I see two problems with this approach. The first one is that deleting all the keys could take a long time, since it serializes the jobs. The second problem I see is that it might not be so easy to reduce the job into a smaller one. For example, if we wanted to send one application less to the next job, we'd need to delete everything for that app first.

Let me know what you think @unleashed , @eguzki .

Hi,

Partitioning stats deletion jobs is the same as partitioning stats key generation task. Partitioning stats key generation has shown to be a interesting problem.

Current implementation generates non overlapping partitions with constrained size of keys. This limitation in the size of keys of each partition allows to keep redis load under control by a single worker. However current implementation of stats key set partitioning pays a toll, which is, it has to compute all the keys in advance. This toll has been shown to be unacceptable for a worker job because the time needed to compute all of them is too high. We have measured that computing 12M stats keys takes ~400secs. Unacceptable by all means for a background job, delaying all remaining jobs in the queue.

So, the interesting problem is about generating partitions in a short time and avoid too many enqueuing of jobs, since that also requires time as @davidor pointed out.

My contribution of this post is to point out that generating non overlapping jobs is not trivial and makes the problem of generating partitions even more interesting. Overlapped jobs not only increase unnecessarily redis load, but also it has to generate more keys than required and generating keys requires time as we pointed out before.

An example of overlapping partitions: lets say we have the following stats job deletion spec:

{              
 service_id: 2555417774060,
 applications: [1,2,3], 
 metrics: [1,2,3], 
 from: Time.new(2002, 9, 01).to_i,
 to: Time.new(2003, 01, 01).to_i)
}

An we decide to partition by 1 month because we have computed that single month partitions contain an acceptable number of keys. So we will have partitions like

{
  from: Time.new(2002, 9, 01).to_i,
  to: Time.new(2002, 10, 01).to_i)
}

or

{
  from: Time.new(2002, 10, 01).to_i,
  to: Time.new(2002, 11, 01).to_i)
}

The problem of overlapping partitions is that those two partitions above have same year for the year granularity, so they will generate the same keys for that specific year. So, two partitions will generate overlapped keys, in this example, for the year period granularity.

Another example of overlapped partitions is partitioning by application. Let's say one job has application array [1,2] and we decide to split in two smaller jobs by application: one for app: 1 the other for app:2. Both jobs will generate same keys for usage service keys (stats/service:service_id/metric/metric_id/period)

The recursive approach also has the problem of non overlapping partitioning besides those commented by david. Let me add that with that approach there is no parallelization, all jobs are serialized in time. We do not have time constraints if system keeps running important jobs concurrently, but it is something to take into account.

More food for thought

For context to anyone reading: the model in Apisonator does not keep references to several entities, and when we keep them there is no referential integrity, so we end up needing to brute force our way out of the problem, because the usual case is that anyone calling into Apisonator fully specifies a key or a small set of keys to act on.

Now I think there is no fundamental problem in the generation of keys itself. All needed parameters are known in advance and the process is fully deterministic.

I want to mention one option which we should take into account for this and other future tasks, which is periodic janitorial tasks, possibly helped by small contributions at key places when doing online work. That would enable us to slowly scan for and build a set of data (think database indexes) to help us avoid the cost of generating keys for a specific period and combination at a minimal cost in space (so it would create the missing references).

That said, the main complexity I see is building facilities to help spread the work over time to avoid grabbing the CPU for too much time. The other problems are avoiding duplicate work and optimising the key generation itself.

Can someone @davidor, @eguzki, provide a minimal baseline benchmark under bench for the current key generation process (the one you both refer to in your comments) to check for progression in that problem?

This no longer applies with the new implementation.

@davidor it might be good to elaborate a bit on why.

@unleashed There are some notes here

apisonator/lib/3scale/backend/stats/cleaner.rb

Line 5 in ecfbd25

# Design notes: