# Expose AI model from Hugging Face using vLLM

## Choose your VM offer

The first step is to choose the VM offer tailored to your needs on [**Sesterce Cloud**](https://cloud.sesterce.com/compute). The choice of the instance will depend on several factors, such as the size of your model and the number of end-users.

<figure><img src="/files/VMmUl4lWDRG7drh9g3jD" alt=""><figcaption></figcaption></figure>

You can consider following examples according to your use-case:

**Small Scale (1-10 simultaneous users)**

* 7B params: L4 (24GB VRAM)
* 13B params: L40S (48GB VRAM)
* Example: Mistral 7B with 8 users in 4K context = 16GB + (8 × 2GB) = 32GB VRAM

**Medium Scale (10-50 users)**

* 7B params : 2-4x L40S in parallel
* 13B params: 4x L40S or H100
* Recommended configuration: Load balancing between multiple GPUs

**Large Scale (50+ users)**

* 7B params: 8x L40S or 2x H100
* 13B+ params: H100 multi-GPU
* Use quantization techniques (INT8/INT4)

## Set-up your instance

Well done, you are now able to configure your VM! Please find here [**the steps required to do it**](/compute-instances/configure-your-compute-instance.md). In "Images" section, choose vLLM option. The instance launch usually takes around 5 minutes.

<figure><img src="/files/eLNEQiYjkLPaQSj6ghle" alt=""><figcaption></figcaption></figure>

## Connect to your instance

When your VM instance is launched, you'll be able to get ssh command to connect into it, like `ssh sesterce@<IP_MACHINE>`.&#x20;

{% hint style="info" %}
**Pay attention:** the docker pull is running! Wait until it's finished to type the following command.
{% endhint %}

When the docker pull running is finished, use the following command:

```
docker run -d -e HF_TOKEN=<HUGGING_FACE_TOKEN> --runtime nvidia --gpus all --net=host --ipc=host vllm/vllm-openai --model <MODEL_ID>
```

You can now fill your [Hugging Face Token](https://huggingface.co/settings/tokens) and Model ID.

<figure><img src="/files/g9rPpFtk3HrUk49f5mM6" alt=""><figcaption><p>Create token from Hugging Face</p></figcaption></figure>

<figure><img src="/files/Kq8vlo1DeeSwWahumuB8" alt=""><figcaption><p>Get your Model ID from Hugging Face</p></figcaption></figure>

## Use your model

Well done! Once ce container is running you'll be able to use the model by typing the following command (container run on port 8000). Make sure you replace the variable in the example with your Model ID.

```
curl -X POST http://<IP_MACHINE>:8000/v1/completions -H "Content-Type: application/json" -d '{"model": "<MODEL_ID>","prompt": "Hello world","max_tokens": 50,"temperature": 0}'
```

Well done! the model is accessible from the following endpoint: [http://\<IP\_MACHINE>:8000/v1/models](http://38.128.232.27:8000/v1/models) :rocket:&#x20;

<figure><img src="/files/WHEwAJAhc9qjypKGFcDA" alt=""><figcaption></figcaption></figure>

## Common errors

### Instance RAM insufficient

Make sure you choose an instance with sufficient RAM to run your model.

### Container not running yet

Wait before container is running to access the model. You can use following command to get container status:

```
// docker ps
```

* If the list is empty, it means the launching failed
* Otherwise, you should see the Container name, you can use it through the following command:

```
// docker logs <CONTAINER_NAME>
```

### Port used or unavailable

You can check port status with following command:

```
// ss -tuln | grep <PORT>
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.sesterce.com/tutorials/expose-ai-model-from-hugging-face-using-vllm.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
