fluencelabs/nox

Warm services watchdog

Closed this issue · 1 comments

folex commented

Motivation

Scalability of services-per-node can be achieved by dynamic unloading (pausing) services by some criteria, and then loading (waking up) on demand.

Implementation

Pausing is basically removing the service without removing its disk state.

Unpausing is starting the service in the same way as the new one.

  • Paused service queue: paused services must retain calls sent to them before they're woken
  • get_interface should work but not wake up the service
  • get service state: paused / awake / non existent
  • some services must be marked as unpausable (e.g., builtins, services that are used by scheduled scripts?)

Notes

  • Services must be ready to node reboot, so they must be ready to pause/unpause.

TODOs

  • Explain in documentation that services should be ready to be restarted
  • We need to measure how long it takes to unpause/start the service to understand if it's usable
    • Most likely we'll need to cache WASM code compiled by Cranelift

It sounds like you are describing a system for dynamically pausing and unpausing services in order to improve the scalability of services-per-node. In this system, services can be paused by removing them without removing their disk state, and then started again on demand. The paused service queue would retain calls sent to the service before it is unpaused, and it would be possible to get the state of a service (paused, awake, or non-existent). Some services may be marked as unpausable, such as built-in services or services used by scheduled scripts.

To implement this system, you may want to consider the following steps:

  1. Implement a way to pause and unpause services by removing and starting them without removing their disk state.
  2. Implement a paused service queue to retain calls made to paused services.
  3. Implement a way to get the interface of a service without waking it up.
  4. Implement a way to get the state of a service (paused, awake, or non-existent).
  5. Mark some services as unpausable if necessary.

It is important to keep in mind that services should be ready to be restarted in the event of a node reboot, so they should be designed to handle being paused and unpaused. You may want to measure how long it takes to unpause and start a service to understand whether this approach is viable. It is also possible that you may need to cache compiled WASM code to improve performance.

In the documentation for your system, you should explain that services should be ready to be restarted in order to be compatible with the pausing and unpausing system. This will help users understand the requirements for using this feature.

use std::collections::HashMap;
use std::sync::{Arc, Mutex};

struct Service {
    // Other fields for the service go here...

    // Flag to indicate whether the service is paused
    paused: bool,
}

struct PausedServiceQueue {
    // Map of service ID to the calls made to the service while it was paused
    queue: HashMap<String, Vec<String>>,
}

impl PausedServiceQueue {
    fn new() -> Self {
        PausedServiceQueue {
            queue: HashMap::new(),
        }
    }

    fn add_call(&mut self, service_id: String, call: String) {
        self.queue.entry(service_id).or_default().push(call);
    }

    fn remove_calls(&mut self, service_id: &str) -> Vec<String> {
        self.queue.remove(service_id).unwrap_or_default()
    }
}

struct ServiceManager {
    services: HashMap<String, Arc<Mutex<Service>>>,
    paused_services: Arc<Mutex<PausedServiceQueue>>,
}

impl ServiceManager {
    fn new() -> Self {
        ServiceManager {
            services: HashMap::new(),
            paused_services: Arc::new(Mutex::new(PausedServiceQueue::new())),
        }
    }

    fn add_service(&mut self, service_id: String, service: Service) {
        self.services.insert(service_id, Arc::new(Mutex::new(service)));
    }

    fn pause_service(&mut self, service_id: &str) {
        let service = self.services.get_mut(service_id).unwrap();
        let mut service = service.lock().unwrap();
        service.paused = true;
    }

    fn unpause_service(&mut self, service_id: &str) {
        let service = self.services.get_mut(service_id