replicate/replicate-python

Feature request: custom autoscaling parameters

Varun2101 opened this issue · 1 comments

Hi, I'm working on deploying a private language model to production through Replicate. I have requests coming in sporadically so provisioning always-on servers is not feasible for me, but I would like requests to be handled at my max concurrency for increased speed. Currently I face 2-2.5 minutes of cold start for each instance and they terminate after 1 minute each, which can lead to some frustrating delays that are longer than necessary.
Would it be possible to add either of these functionalities?

  1. API to force boot n instances together: reduces the spread of boot time, more control to start the boot process early before requests actually need to be processed
  2. Custom idle time limits: this needs to be at least as long as the boot time. I wouldn't mind having to pay for some extra uptime if it meant I don't have stop-start behaviour in the middle of a chunk being processed.

Currently I'm attempting a workaround for no.1 by burst-pinging the model early with the default input n times, but the short idle time means that there's still a good chance that the instances get terminated before I send any actual requests. Let me know if you have a better solution.
Thanks!

Hi, @Varun2101. Thanks for sharing this feedback.

To your second point, you can get more control over the behavior of a model on Replicate by creating a deployment. I don't believe we provide a way to configure the timing for autoscaling a deployment, but that's something we've discussed.