Expose AI model from Hugging Face using vLLM
You want to expose to the world an API endpoint to allow your end-users to access an AI Model from Hugging Face? This tutorial is for you!
Last updated
Was this helpful?
You want to expose to the world an API endpoint to allow your end-users to access an AI Model from Hugging Face? This tutorial is for you!
Last updated
Was this helpful?
The first step is to choose the VM offer tailored to your needs on . The choice of the instance will depend on several factors, such as the size of your model and the number of end-users.
You can consider following examples according to your use-case:
Small Scale (1-10 simultaneous users)
7B params: L4 (24GB VRAM)
13B params: L40S (48GB VRAM)
Example: Mistral 7B with 8 users in 4K context = 16GB + (8 × 2GB) = 32GB VRAM
Medium Scale (10-50 users)
7B params : 2-4x L40S in parallel
13B params: 4x L40S or H100
Recommended configuration: Load balancing between multiple GPUs
Large Scale (50+ users)
7B params: 8x L40S or 2x H100
13B+ params: H100 multi-GPU
Use quantization techniques (INT8/INT4)
When your VM instance is launched, you'll be able to get ssh command to connect into it, like ssh sesterce@<IP_MACHINE>
.
When the docker pull running is finished, use the following command:
Well done! Once ce container is running you'll be able to use the model by typing the following command (container run on port 8000). Make sure you replace the variable in the example with your Model ID.
Make sure you choose an instance with sufficient RAM to run your model.
Wait before container is running to access the model. You can use following command to get container status:
If the list is empty, it means the launching failed
Otherwise, you should see the Container name, you can use it through the following command:
You can check port status with following command:
Well done, you are now able to configure your VM! Please find here . In "Images" section, choose vLLM option. The instance launch usually takes around 5 minutes.
You can now fill your and Model ID.
Well done! the model is accessible from the following endpoint: