Autoscaling limits
Last updated
Last updated
With Sesterce Cloud's AI inference feature, you can define autoscaling limits to manage the scale of your model while keeping your expenses under control.
To define your autoscaling limit, you can enter the following main two elements:
The minimum and maximum number of pods to be mobilized
The triggers to scale automatically your ressources
From the dedicated view, define the minimum and maximum number of pods to be mobilized to infere your model.
If you want to deploy your model in several regions, you can choose to set the same limits in each of them, or specify specific limits for each region.
This is particularly useful if you expect to have a peak of users on your endpoint in a specific region.
The cooldown period corresponds to intervals (in seconds) the between trigger executions. This setting helps avoid frequent and unnecessary scaling adjustments. You can select a value between 1 and 3600 seconds.
From the dedicated form (see below), choose among the 3 available triggers type:
GPU memory utilization
Memory utilization
HTTP requests
For each of these triggers types, you can set a threshold value in percents. When this value is reached, then pod numbers increase within limits and decrease to the initial level when values drop, ensuring stable operation and high performance.
You can configure up to 3 threshold limits corresponding to each of the 3 triggers types.
The minimum setting is 1% of the resource capacity, and only the HTTP requests trigger can scale pods to and from 0. The maximum setting is 100% of the resource capacity.