- Assume a system in which many jobs are run, each job has
{priority, start time, end time, memory requested, and other proprietary fields}
. - A job need memory & compute resources to be available in the system in order to run.
- In case there are none available, Job will fail.
Given log of all job executions from the past year, including size of memory used.
Study the concurrency of Jobs, in order to optimize pre-allocated resources / budget
Implement the logical component for calculating concurrency of Jobs, and resources used. And offer ways to deduce how many resources should be pre-allocated in order to sustain proper functioning of the system.
Note: aiming for 100% availability may be possible yet costly, so please plan for certain ratio of failure ~10%.
- The log was generated with a "100% availability" system (no failed jobs)
- Log is sorted
The optimal value(s) will be found by iterating on possible optimal value(s).
Each iteration will check a different optimal value(s) and figure out if the acceptable failure ratio constraint was met.
Basically, we'll re-run the log with different resource constraints each time.
Before running the "simulations", we need to determine:
- What is the "10%" failure rate constraint (in concrete numbers)
- How: Find the total number of executed Jobs (log's number of rows)
- The maximum resource(s) usage level - will later be used as baseline (starting point)
- How: Check current system configuration/setup
- Or: run an iteration as describe below just to capture this data and then
SELECT MAX(memory_usage) from <table_name>
- Or: run an iteration as describe below just to capture this data and then
- How: Check current system configuration/setup
To run the "simulations" we need to track the resource(s) usage in every point in time.
Let's assume a 1-minute granularity is enough (the implementation below will work with a more granular timeframe).
We'll use MySQL in this example but any other storage type (e.g. memory) may work as well.
Create a table, in which each row represents a minute in time, with the current level of resource(s) usage:
start_time
, Datetimememory_usage
, Int (initialized to 0)xxx_usage
, Int (initialized to 0)
Basically, we'll parse the log line by line, accumulate/sum the resource usage level in the proper column and count failed jobs, given a specific resource limitation.
Pseudo code:
- Define a
memoryStep
- Used to calculatemaxMemoryAvailable
in each iteration - Initialize
maxMemoryAvailable
to the value found in Step 1 (maximum used) minusmemoryStep
- Initialize
optimalMemoryLevel
tomaxMemoryAvailable
Iteration:
- Parse the log and for each line (executed Job):
- Check if there are enough resources for it to run
- If it cannot run
- Increment a
failedJobsCounter
- If
failedJobsCounter
is greater than the allowed ratio- This iteration fails =>
optimalMemoryLevel
is the optimal value - Break
- This iteration fails =>
- If
- Increment a
- Else
- Add it to the DB (similar to Step 1) and move on to the next line
- Continue (to the next job/line)
- If it cannot run
- Check if there are enough resources for it to run
If an iteration (a full log parse) completed successfully:
- Set
optimalMemoryLevel
tomaxMemoryAvailable
- Set
maxMemoryAvailable
tomaxMemoryAvailable
-memoryStep
- Truncate the table
- Iterate again
Else
- Done
Done: optimalMemoryLevel
is the optimal value
- Use binary search instead of the algo described in Step 3 - O(log(n))
- Use MySQL query "simulations" instead of reading the log line by line:
- Populate the table with no resource limitation at all (all jobs get stored in the table)
- Populate another table with every executed job
start_time
end_time
- Iterate on possible optimal values and run an SQL query to find how many jobs will fail for a defined
maxMemoryAvailable
:- Find the timeframes in which the resource usage exceeds
maxMemoryAvailable
. - Count the number of jobs that ran in these timeframes:
- A job ran in a given timeframe if its
start_time
orend_time
are in the timeframe
- A job ran in a given timeframe if its
- Find the timeframes in which the resource usage exceeds