Autoscaling limits

With Sesterce Cloud's AI inference feature, you can define autoscaling limits to manage the scale of your model while keeping your expenses under control.

To define your autoscaling limit, you can enter the following main two elements:

The minimum and maximum number of pods to be mobilized
The triggers to scale automatically your ressources

How to define autoscaling limits?

From advanced configuration settings, define the minimum and maximum number of pods to be mobilized to infere your model.

If you want to deploy your model in several regions, you can choose to set the same limits in each of them, or specify specific limits for each region.

This is particularly useful if you expect to have a peak of users on your endpoint in a specific region.

What is the Cooldown Period?

The cooldown period corresponds to intervals (in seconds) the between trigger executions. This setting helps avoid frequent and unnecessary scaling adjustments. You can select a value between 1 and 3600 seconds.

How to set autoscaling triggers?

From the dedicated form (see below), choose among the 3 available triggers type:

GPU memory utilization
Memory utilization
RAM usage
HTTP requests

For each of these triggers types, you can set a threshold value in percents. When this value is reached, then pod numbers increase within limits and decrease to the initial level when values drop, ensuring stable operation and high performance.

You can configure up to 3 threshold limits corresponding to each of the 3 triggers types.

The minimum setting is 1% of the resource capacity, and only the HTTP requests trigger can scale pods to and from 0. The maximum setting is 100% of the resource capacity.

PreviousSelect your regions NextEdit an inference instance

Last updated 5 months ago

Was this helpful?