/logs-drainer

Primary LanguageJavaScript

Solution Explanation

Given that the number of sources could potentially be very large, we want to optimize the process of accessing each of them. A simple solution for this is to use a hash table. In JS one of data structures that uses a hash table is a Set, which allows for O(1) time complexity when accessing elements.

As mentioned in the task, the number of all logs could be enormous, meaning they can't be fully loaded into memory at once. Therefore, we'll store only one log from each source at any given time. It's important to note that this solution still requires the server to be able to store one element from each source simultaneously. If this isn't possible, for example if there are billions of sources and downloading even one log from each would exceed the memory available for a Node.js process—we might need a more complex solution that involves parallel programming, utilizing SharedArrayBuffer, and other parallel programming abstractions, or maybe even running multiple processes and orchestrating them by using such build-in modules as cluster or child_process.

For now we assume that we're able to store one log from each source. To effectively decide which log should be printed next, we'll use a priority queue implemented with a min-heap. This allows us to dequeue the next value in O(log n) time and enqueue a new one in the same O(log n) time. The idea is to store the oldest log from each source in this heap. When we print a value, we then enqueue a new log from the source of the most recently printed log. This ensures that at any given time, we have either zero (if the source is empty) or one log from each source. We access the required source by storing a reference to it in the wrapper over our log entry in the heap. Although this requires some additional memory, it's manageable since we store the reference, not the whole object.

In second task, whrn a source provides only an asynchronous way of reading logs, we can improve the solution by reading initial logs more efficiently. We use Promise.all for parallel execution of potential input operations. However, if we have millions of sources, working with all of them in parallel might cause resource starvation, depending on the type of source. To avoid this, we'll implement batch reading, where only a certain number of operations are executed in parallel, and as soon as one batch is complete, the next one starts. It's reached by combination of using for loop and Promise.all(). In my solution, the batch size was set to 5000, but this can be adjusted based on the type of logs. I assume the logs are homogeneous, so most reading operations will take a similar amount of time. However, if this isn't the case, we can improve the solution by starting new reading operations immediately after any operation finishes, rather than waiting for the entire batch to complete.

For example, if 4999 operations take 0.1 seconds each to read, and one takes 0.5 seconds, the next batch would wait 0.5 seconds, which is inefficient. To implement this solution, we could use packages like p-limit or even create our own implementation.

Unfortunately, since asynchronous operations usually involve accessing external sources to retrieve data—which can take a lot of time—and the task requires being aware of the next element in every source to choose the correct one to print, we can't optimize much due to memory constraints. In the current implementation, after each print, we must wait until the new asynchronous reading is complete to make the right decision about which element to print next. There are some possible solutions to slightly improve this situation, though they make certain assumptions about the data we're working with, so they are less general:

Using Prefetching - Prefetch Logs: After pushing a log to the heap, immediately start fetching the next log from that source asynchronously and store the promise. - Handle Prefetched Logs: When we dequeue a log from the heap, immediately check if the next log (from the prefetch promise) is available. If the promise has already resolved, push the new log into the heap right away. - Handling Awaited Logs: If the prefetch promise hasn't resolved yet, we can push the current log and continue processing other logs until the promise resolves.

If we do some simple math, we can come up with the following results:

  • Async Fetch Time (T_fetch): The average time it takes to fetch the next log from a source.
  • Synchronous Print Time (T_print): The time it takes to print a log (which we'll assume is very small compared to T_fetch).
  • Number of Sources (N): The number of sources we're dealing with.
  • Number of Logs (L): The total number of logs we need to print.

Without Prefetching:

  • Every time we print a log, we wait for the next log to be fetched asynchronously.
  • Total time = L * T_fetch

With Prefetching:

  • Ideally the next log is already fetched by the time we need to print it, so we don’t wait as often.
  • The time saved per log depends on how often the prefetching completes before we need the log.

Best Case:

  • If prefetching always completes before we need the log, the only waiting time is for the initial fetches from all sources.
  • Total time = N * T_fetch + (L - N) * T_print (approximately N * T_fetch)

Worst Case:

  • If prefetching rarely helps (e.g., because fetching is slow or logs are tightly packed in time), the savings are minimal.
  • Total timeL * T_fetch

Average Case:

  • Prefetching generally reduces the number of times we have to wait for T_fetch, especially if T_print is small.
  • If we assume prefetching successfully overlaps with fetching about 50% of the time, the average total time might be closer to:
    • Total timeL * (0.5 * T_fetch + 0.5 * T_print)
    • Since T_print is very small, this simplifies to:
    • Total timeL * 0.5 * T_fetch

Estimated Savings:

  • Without Prefetching: L * T_fetch
  • With Prefetching: L * 0.5 * T_fetch

However, this approach requires at least twice as much memory to store the promise and then the value for each prefetched item. Therefore, we need to choose wisely between these solutions based on our specific situation.