Consider/experiment with avoiding scan server refs by delaying GC

Question

Consider/experiment with avoiding scan server refs by delaying GC

Opened this issue 4 months ago · 1 comments

Related to #4529, which proposed moving scan server file refs to their own table.

An alternative to consider would be to try to avoid the need to create scan server refs in the first place. This could be done by ensuring GC candidates are not deleted immediately, but deletion is delayed enough to give time for existing scan servers to finish using them (in other words, delay at least as long as it takes for the scan server file ref cache to expire).

One advantage of this would be to reduce the use of the metadata table and the remove the need to constantly update the file refs based on the current scan servers.

Even delaying GC, some refs may still be needed for long-running scans that are not yet complete.

So, some challenges include:

tracking how long files need to wait before they are deleted (if the GC keeps track in its memory, it could use a lot of memory; if it's persisted, then we have to have a way of representing the time that is recorded, and global time tracking in a cluster can be tricky)
determining when a scan server needs to store a ref for a long-running scan

I'm not certain this is a better idea than #4528, but there may be advantages that are worth pursuing instead of that (or maybe in addition to that).

Answer 1 · 2024-05-10T16:10:44.000Z

Chatted w/ @cshannon about this. One challenge we identified is large compaction operations. For example if a large number of external compaction processes are temporarily stood up and then a large compaction operation is initiated, it may start to generate large numbers of files in a short time that should be deleted. In the worst case if the GC delays deleting files it could result in the compaction operation filling up DFS. This implies that the delay may need to adjust depending on DFS free space and the amount of files that could be deleted, but are delayed. Dynamically adjusting the delay makes it harder for the scan servers to reason about it.

Another thing discsussed was race conditions. This change would conceptually create another set of files the GC is tracking.

File references (existing set)
GC candidates (existing set)
Delayed delete files (new set)
Deleting files (existing set)

When the GC moves a file from the new delayed_delete set to the deleting_files set, it must be done in a such a way that considers what the scan servers are using and not have race conditions. We talked through a few ways to do this, but those possible solutions had race conditions. So still need to figure out something for that.

This offers a potential speed up to scan servers for writing out scan server refs to the metadata table. The scan server will still have to read tablet files from the metadata table, which is an extra cost over tablet servers. That leads to another potential solution to lower scan time as some way to pre load tablets on select scan servers, that may be something to explore in tandem.