> For the complete documentation index, see [llms.txt](https://docs.sesterce.com/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.sesterce.com/ai-inference-instances/inference-instance-configuration/autoscaling-limits.md).

# Autoscaling limits

With Sesterce Cloud's AI inference feature, you can define autoscaling limits to manage the scale of your model while keeping your expenses under control.

To define your autoscaling limit, you can enter the following main two elements:

1. The minimum and maximum number of pods to be mobilized&#x20;
2. The triggers to scale automatically your ressources

### How to define autoscaling limits?

From advanced configuration settings, define the minimum and maximum number of pods to be mobilized to infere your model.&#x20;

If you want to deploy your model in several regions, you can choose to set the same limits in each of them, or specify specific limits for each region.

{% hint style="info" %}
This is particularly useful if you expect to have a peak of users on your endpoint in a specific region.
{% endhint %}

<figure><img src="/files/KMJyyfkD67PNUolO7xVR" alt=""><figcaption></figcaption></figure>

### What is the Cooldown Period?&#x20;

The cooldown period corresponds to intervals (in seconds) the between trigger executions. This setting helps avoid frequent and unnecessary scaling adjustments. You can select a value between 1 and 3600 seconds.

### How to set autoscaling triggers?

From the dedicated form (see below), choose among the 3 available triggers type:

* GPU memory utilization
* Memory utilization
* RAM usage
* HTTP requests

<figure><img src="/files/IuolWUkIDUqx376vRBLt" alt=""><figcaption></figcaption></figure>

For each of these triggers types, you can set a threshold value in percents. When this value is reached, then pod numbers increase within limits and decrease to the initial level when values drop, ensuring stable operation and high performance.

{% hint style="info" %}
You can configure up to 3 threshold limits corresponding to each of the 3 triggers types.
{% endhint %}

The minimum setting is **1% of the resource capacity**, and **only the HTTP requests trigger can scale pods to and from 0.** The maximum setting is 100% of the resource capacity.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sesterce.com/ai-inference-instances/inference-instance-configuration/autoscaling-limits.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
