Sesterce Cloud Doc
  • 👋Welcome on Sesterce Cloud
    • 🚀Get Started!
      • Account creation
      • Manage your account
      • Payment & Billing
        • Invoicing
  • 🚀Compute instances
    • Compute Instance configuration
      • Persistent storage (volumes)
      • SSH Keys
    • Terminal connection
  • 💬AI Inference instances
    • Inference Instance configuration
      • Select your Flavor
      • Select your regions
      • Autoscaling limits
    • Edit an inference instance
    • Chat with Endpoint
  • ▶️Manage your instances
  • 🔗API Reference
    • Authentication
    • GPU Cloud instances
    • SSH Keys
    • Volumes
    • Inference Instances
  • 📗Tutorials
    • Expose AI model from Hugging Face using vLLM
Powered by GitBook
On this page
  • How to define autoscaling limits?
  • What is the Cooldown Period?
  • How to set autoscaling triggers?

Was this helpful?

  1. AI Inference instances
  2. Inference Instance configuration

Autoscaling limits

PreviousSelect your regionsNextEdit an inference instance

Last updated 2 months ago

Was this helpful?

With Sesterce Cloud's AI inference feature, you can define autoscaling limits to manage the scale of your model while keeping your expenses under control.

To define your autoscaling limit, you can enter the following main two elements:

  1. The minimum and maximum number of pods to be mobilized

  2. The triggers to scale automatically your ressources

How to define autoscaling limits?

From advanced configuration settings, define the minimum and maximum number of pods to be mobilized to infere your model.

If you want to deploy your model in several regions, you can choose to set the same limits in each of them, or specify specific limits for each region.

This is particularly useful if you expect to have a peak of users on your endpoint in a specific region.

What is the Cooldown Period?

The cooldown period corresponds to intervals (in seconds) the between trigger executions. This setting helps avoid frequent and unnecessary scaling adjustments. You can select a value between 1 and 3600 seconds.

How to set autoscaling triggers?

From the dedicated form (see below), choose among the 3 available triggers type:

  • GPU memory utilization

  • Memory utilization

  • RAM usage

  • HTTP requests

For each of these triggers types, you can set a threshold value in percents. When this value is reached, then pod numbers increase within limits and decrease to the initial level when values drop, ensuring stable operation and high performance.

You can configure up to 3 threshold limits corresponding to each of the 3 triggers types.

The minimum setting is 1% of the resource capacity, and only the HTTP requests trigger can scale pods to and from 0. The maximum setting is 100% of the resource capacity.

💬