Remove document cache after a certain inactivity

Describe the problem to be solved

Currently, in a multi-workers architecture, the /persist folder is filled of documents that has been loaded. This cache is only removed after a doc worker restart.

Describe the solution you would like

EDIT 2025-04-24: we made this comment for the solution we would like.

It would be neat that once the document has been closed, and after a certain time of inactivity the local copy in /persist is deleted.

If in the meantime (so after a document has been closed) there is some activity, the document is loaded again and the timer for wiping the local cache is disabled.

The "Housekeeper" might be a natural place for this code to live:

grist-core/app/gen-server/lib/Housekeeper.ts

Lines 29 to 41 in ec0b885

    
           /** 
        
            * Take care of periodic tasks: 
        
            * 
        
            *  - deleting old soft-deleted documents 
        
            *  - deleting old soft-deleted workspaces 
        
            *  - logging metrics 
        
            * 
        
            * Call start(), keep the object around, and call stop() when shutting down. 
        
            * 
        
            * Some care is taken to elect a single server to do the housekeeping, so if there are 
        
            * multiple home servers, there will be no competition or duplication of effort. 
        
            */ 
        
           export class Housekeeper {

It may be simpler than other housekeeping tasks, since it is scoped to an individual doc worker.

I thought of this place as well. But I have rather considered adding a timer in ActiveDoc for these reasons:

there already exists a timer for inactivity before closing the document, I thought of adding another one which would be enabled after the document is closed for inactivity;
instead of letting each document worker inspect their documents and wipe the unused cache, the document would remove its own cache just like it closes itself;

If that makes sense, otherwise I would be glad to hear your feedback :).

I have rather considered adding a timer in ActiveDoc

Hmm you'll need to work carefully. When a document is unused for some time, its ActiveDoc will currently shut down to recover RAM, which means that the ActiveDoc no longer exists. That would be a bit early to wipe the file from disk in my opinion, since it would be common for a document to be visited say once a day. You do want to reclaim RAM in that scenario, but disk space, arguable, given that recovery from s3 can take several seconds (depending).

It is possible to add a new state to ActiveDoc, but it will need a good mental model to avoid breaking interactions with DocManager and other classes. Definitely doable but be prepared for adventures.

The bulk of RAM is held by the sandbox associated with the ActiveDoc so holding ActiveDocs around after sandboxes shut down is certainly not ruled out, just a significant change.

Oh, I see, a shutdown unregisters the ActiveDoc from the DocManager, which means, if I am correct, that reopening the document after that would mean create a new instance of ActiveDoc.

I thought the ActiveDoc instance would be reused.

Now using Housekeeper makes total sense to me. Thank you @paulfitz!

We have discussed internally about how to invalidate the doc worker cache at GristGouv, we would like to propose an alternative approach: delete the cache immediataly after the document has been closed.

Implication

A document remains open:

while there exists an active websocket connection for this document;
and 30 seconds in production (5 seconds otherwise) after the last actions on this document through the API or after the last WebSocket connection has been closed;

Once the document is closed and the document pushed on the remote storage, the doc worker would delete its local copy and release the document assignment to itself.

Also it means that when the doc worker has stopped working on a document, the S3 storage becomes the only source of truth. Also when a user wants to reopen it, it should take in the worst cases seconds to download the document on the doc worker (so probably not a big drawback).

Motivations

By removing the assignment of the document to the doc worker right after the document is closed, we see the following benefits:

it's much simpler to implement, as we don't have to manage a timer (especially a timer that would trigger a timer), and easier to check and test;
By releasing the document worker, we can reduce the risk of overloading other doc workers when a doc worker crashes (what we call the domino effect at Grist.Gouv);
We don't have to create a new API for deleting the cache, a call to /force-reload does the trick and keeps its meaning (it forces a reload especially by fetching the document from S3), we would not need the endpoint proposed in this PR;
A manual call to /force-reload would also reassign the document to another worker, which can help when a doc worker is overloaded.
Also it reduces lingering disk usage on the doc worker as the closed documents are removed.
It also allows simplifying the document opening logic (HostedStorageManager._claimDocument) by removing the check for a local copy. That check is relatively expensive by the way, since it currently first makes a backup of the found copy to compute a hash in order to detect a mismatch with the S3 version, in order to avoid an S3 download — it's possible doing all that isn't cheaper on average than unconditionally downloading from S3.

What do you think? :)

@paulfitz @dsagal @georgegevoian any point of view about this strategy?

@paulfitz @dsagal @georgegevoian any point of view about this strategy?

With GRIST_EXPERIMENTAL_WORKER_ASSIGNMENT set, it would now be advantageous for existing assignments to be re-assigned periodically. This sounds like a simple way to achieve that, so that's a plus.

The extra latency when opening a document at rest is unfortunate. A worst case scenario here would be any sort of automated API activity that occurs at intervals just over the 30 second timeout. Adding a few seconds to each request here (in the extreme case) wouldn't be great.

It might make sense to tweak the timeouts to be more conservative if caching is removed entirely. (Say, on the order of minutes instead of seconds. To avoid outlier cases like semi-frequent API requests.)

	/**
	* Take care of periodic tasks:
	*
	* - deleting old soft-deleted documents
	* - deleting old soft-deleted workspaces
	* - logging metrics
	*
	* Call start(), keep the object around, and call stop() when shutting down.
	*
	* Some care is taken to elect a single server to do the housekeeping, so if there are
	* multiple home servers, there will be no competition or duplication of effort.
	*/
	export class Housekeeper {