Sesterce Cloud Doc
  • 👋Welcome on Sesterce Cloud
    • Glossary
    • Account creation
    • Manage your account
    • Payment & Billing
    • Invoicing
  • 🚀Compute instances
    • Configure your Compute Instance
      • SSH Keys management
      • Persistent storage (volumes)
    • Terminal connection
  • 💬AI Inference instances
    • Inference Instance configuration
      • Select your Flavor
      • Select your regions
      • Autoscaling limits
    • Edit an inference instance
    • Chat with Endpoint
  • ▶️Manage your instances
  • 🔗API Reference
    • Authentication
    • GPU Cloud instances
    • SSH Keys
    • Volumes
    • Inference Instances
  • 📗Tutorials
    • Which compute instance for AI models training and inference?
    • Expose AI model from Hugging Face using vLLM
Powered by GitBook
On this page
  • Why is the compute instance choice so important?
  • Which compute instance to choose for model training?
  • 1. Large Language Models (LLMs) models training
  • 2. Computer Vision models training
  • 3. Audio/Speech models training
  • Which compute instance to choose for model inference?
  • 1. Large Language Models (LLMs) inference
  • 2. Image Generation Inference Sizing

Was this helpful?

  1. Tutorials

Which compute instance for AI models training and inference?

Choosing the right GPU instance is crucial to the success of your AI projects. An optimal configuration not only improves performance, but also keeps costs under control. This guide will help you navigate through the various options available to find the ideal solution for your needs.

Why is the compute instance choice so important?

Your choice of GPU infrastructure has a direct impact on..:

  • Performance: model training speed and inference latency

  • Cost: optimizing your budget by avoiding over-sizing

  • Scalability: ability to scale according to your needs

  • Reliability: stability of your workloads in production

Which compute instance to choose for model training?

1. Large Language Models (LLMs) models training

Fine-tuning Large Language Models represents one of the most resource-intensive tasks in modern AI development.

The hardware requirements vary significantly based on model size, from smaller 7B parameter models to massive 70B+ architectures. This section will help you select the optimal configuration for your fine-tuning project, ensuring efficient resource utilization while maintaining performance.

Model Size
Server type
VRAM
Recommended Offers

Small (less than 7B parameters)

VM

From 22 GB (for 1B parameters models) to 140 GB (to fine-tune models such as DeepSeek-R1 7B)

Medium (12B-32B)

VM or Bare-Metal

From 200 to 500 GB (to fine-tune models such as DeepSeek-R1 32B)

Large (70B and more)

Bare-Metal

More than 1000GB

2. Computer Vision models training

Whether you're developing object detection systems, processing medical imagery, or creating next-generation AI art, selecting the right GPU infrastructure is crucial for your success.

The computational requirements for vision tasks vary significantly based on complexity and scale. Classification tasks might require modest GPU power, while advanced generative models demand substantial computational resources. This section outlines three primary categories of computer vision workloads - classification, segmentation, and generation - each with its unique hardware requirements and optimal configurations.

Use Case
Models example
Resource intensity
Batch

ResNet, YOLO

⭐️⭐️⭐️

32-64

U-Net, DeepLab

⭐️⭐️⭐️⭐️

16-32

Stable Diffusion

⭐️⭐️⭐️⭐️⭐️

8-16

Classification Models

Data Volume
Training Time
Recommended Instance
Type
Comment

<100GB

<24h

VM

Perfect for development and testing

100-500GB

1-3 days

VM

Parallel processing beneficial

500GB-1TB

3-7 days

VM/BM

Higher throughput needed

>1TB

>1 week

BM

Bare Metal for optimal performance

Segmentation Models

Data Volume
Training Time
Recommended Instance
Type
Comment

< 200GB

< 48h

VM

High-res image processing

200GB-1TB

3-5 days

VM

Multiple batch processing

1TB-5TB

1-2 weeks

BM

Heavy data augmentation

> 5TB

> 2 weeks

BM

Maximum processing power

Generative Models

Data Volume
Training Time
Recommended Instance
Type
Notes

< 500GB

< 3 days

VM

Model fine-tuning

500GB-2TB

3-7 days

BM

Full model training

2TB-10TB

1-3 weeks

BM

Large scale training

> 10TB

> 3 weeks

BM

Distributed training

3. Audio/Speech models training

Whether you're developing voice recognition systems, building text-to-speech applications, or exploring the cutting edge of AI music generation, selecting the right GPU infrastructure is crucial for successful model training.

This section outlines three primary categories of audio ML workloads - speech recognition, text-to-speech synthesis, and audio generation - each requiring specific hardware configurations to achieve optimal performance.

Use Case
Input Data Type
Dataset Size
Model Examples
Resource Intensity

Speech Recognition

Audio files (.wav, .mp3), Labeled transcripts

100GB-1TB

Whisper, DeepSpeech, Wav2Vec

⭐⭐⭐

Text-to-Speech

Text corpus, Audio pairs

50-500GB

Tacotron, FastSpeech, VALL-E

⭐⭐⭐⭐

Audio Generation

Audio samples, MIDI files

1-2TB

MusicLM, AudioLDM, Stable Audio

⭐⭐⭐⭐⭐

Speech Recognition Models

Model Size (param)
VRAM/RAM Needed
Recommended Instance
Type
Comment

Small (< 100M params)

24GB/64GB

VM

Development/testing

Medium (100M-500M)

48GB/128GB

VM

Production training

Large (500M-1B)

96GB/256GB

VM/BM

Large scale training

Very Large (>1B)

160GB/384GB

BM

Enterprise scale

Text-to-Speech Models

Model Size
VRAM/RAM Needed
Recommended Instance
Type
Notes

Small (< 200M params)

80GB/192GB

VM

Basic TTS

Medium (200M-500M)

160GB/384GB

VM

Multi-speaker

Large (500M-1B)

320GB/768GB

BM

High-quality TTS

Very Large (>1B)

640GB/1TB

BM

Enterprise TTS

Which compute instance to choose for model inference?

1. Large Language Models (LLMs) inference

Whether you're serving chatbots, content generation, or text analysis applications, choosing the right infrastructure is crucial for balancing performance, cost, and user experience.

The requirements for LLM inference vary significantly based on several key factors: model size (from 7B to 70B+ parameters), user load (from individual testing to thousands of concurrent users), and latency requirements (from real-time chat applications to batch processing). Each of these factors directly impacts your choice of infrastructure, from single GPU instances to distributed multi-GPU deployments.

LLM Inference Sizing - Small Scale (1-50 concurrent users)

Model Size
Concurrent Users
VRAM/RAM needed
Latency Target
Recommended Instance
Type
Estimated RPS*
Comment

7B

1-10

16GB/32GB

< 100ms

VM

15-20

Development/testing

7B

11-25

24GB/64GB

< 100ms

VM

30-40

Small production

7B

26-50

48GB/128GB

< 100ms

VM

60-80

Medium production

13B

1-10

24GB/64GB

< 150ms

VM

10-15

Development/testing

13B

11-25

48GB/128GB

< 150ms

VM

25-35

Small production

13B

26-50

80GB/192GB

< 150ms

VM/BM

50-70

Medium production

70B

1-10

80GB/192GB

< 200ms

VM/BM

5-8

Small production

70B

11-25

160GB/384GB

< 200ms

BM

15-20

Medium production

70B

26-50

320GB/768GB

< 200ms

BM

35-45

Large production

LLM inference Sizing - Medium Scale (51-200 concurrent users)

Model Size
Concurrent Users
VRAM/RAM
Latency Target
Recommended Instance
Type
Estimated RPS*
Notes

7B

51-100

48GB/128GB

< 100ms

VM

100-120

Production

7B

101-150

80GB/192GB

< 100ms

VM/BM

150-180

High-performance

7B

151-200

160GB/384GB

< 100ms

BM

200-240

Enterprise scale

13B

51-100

160GB/384GB

< 150ms

BM

80-100

Production

13B

101-150

240GB/512GB

< 150ms

BM

120-150

High-performance

13B

151-200

320GB/768GB

< 150ms

BM

160-200

Enterprise scale

70B

51-100

480GB/1TB

< 200ms

BM

60-80

Production

70B

101-150

640GB/1.5TB

< 200ms

BM

90-120

High-performance

70B

151-200

800GB/2TB

< 200ms

BM

140-180

Enterprise scale

LLM Inference Sizing - Large Scale (201-1000+ concurrent users)

Model Size
Concurrent Users
VRAM/RAM
Latency Target
Recommended Instance
Type
Estimated RPS*
Notes

7B

201-500

320GB/768GB

< 100ms

BM

300-400

Enterprise scale

7B

501-1000

640GB/1.5TB

< 100ms

BM

600-800

High-scale production

7B

1000+

1.2TB/2.5TB

< 100ms

BM

1000+

Distributed clusters

13B

201-500

480GB/1TB

< 150ms

BM

250-350

Enterprise scale

13B

501-1000

800GB/2TB

< 150ms

BM

500-700

High-scale production

13B

1000+

1.6TB/3TB

< 150ms

BM

800+

Distributed clusters

70B

201-500

1.2TB/2.5TB

< 200ms

BM

200-300

Enterprise scale

70B

501-1000

2TB/4TB

< 200ms

BM

400-600

High-scale production

70B

1000+

3TB/6TB

< 200ms

BM

700+

Distributed clusters

2. Image Generation Inference Sizing

Deploying image generation models like Stable Diffusion for production introduces unique infrastructure challenges compared to traditional ML workload.

The hardware requirements vary significantly based on three key factors: model complexity (from base models to SDXL with refiners), concurrent user load (affecting batch processing and queue management), and image generation parameters (resolution, steps, and additional features like ControlNet or inpainting). Each of these factors directly impacts your choice of infrastructure and can significantly affect both performance and operational costs.

Small scale (1-50 concurrent users)

Model Type
Concurrent Users
VRAM/RAM
Latency Target*
Recommended Instance
Type
Images/Minute**
Notes

SD XL Base

1-10

16GB/32GB

< 3s

VM

15-20

Development/testing

SD XL Base

11-25

24GB/64GB

< 3s

VM

30-40

Small production

SD XL Base

26-50

48GB/128GB

< 3s

VM

60-80

Medium production

SD XL + Refiner

1-10

24GB/64GB

< 5s

VM

10-15

Development/testing

SD XL + Refiner

11-25

48GB/128GB

< 5s

VM

25-35

Small production

SD XL + Refiner

26-50

80GB/192GB

< 5s

VM or BM

50-70

Medium production

Medium scale (51-200 concurrent users)

Model Type
Concurrent Users
VRAM/RAM
Latency Target*
Recommended Instance
Type
Images/Minute**
Notes

SD XL Base

51-100

160GB/384GB

< 3s

VM or BM

120-150

Production

SD XL Base

101-150

320GB/768GB

< 3s

BM

200-250

High-performance

SD XL Base

151-200

480GB/1TB

< 3s

BM

300-350

Enterprise scale

SD XL + Refiner

51-100

320GB/768GB

< 5s

BM

100-130

Production

SD XL + Refiner

101-150

480GB/1TB

< 5s

BM

180-220

High-performance

SD XL + Refiner

151-200

640GB/1.5TB

< 5s

BM

250-300

Enterprise scale

Large scale (201-1000+ concurrent users)

Model Type
Concurrent Users
VRAM/RAM
Latency Target*
Recommended Instance
Type
Images/Minute**
Notes

SD XL Base

201-500

800GB/2TB

< 3s

BM

400-500

Multi-cluster

SD XL Base

501-1000

1.6TB/4TB

< 3s

BM

800-1000

Distributed system

SD XL Base

1000+

2.4TB/6TB

< 3s

BM

1500+

Global distribution

SD XL + Refiner

201-500

1.2TB/3TB

< 5s

BM

350-450

Multi-cluster

SD XL + Refiner

501-1000

2TB/5TB

< 5s

BM

700-900

Distributed system

SD XL + Refiner

1000+

3TB/8TB

< 5s

BM

1200+

Global distribution

PreviousInference InstancesNextExpose AI model from Hugging Face using vLLM

Last updated 3 days ago

Was this helpful?

1 to 3B models: The most cost-effective:

The most efficient:

7B models:

12 to 14B models: 27 to 32B models:

or

(images, videos): object detection, face recognition

(pixel-level): medical imaging, satellite analysis

(stable diffusion): image generation, AI art

📗
👉
👉
1xA100 80G
👉
1xH100 80G
1xH200 141 GB
👉
2xH200 141 GB
👉
8xA100 80G Bare Metal
4xH200 141 GB
👉
👉
8xH200 141 GB Bare-Metal
8xB200 192 GB Bare-Metal
1xL40S 48 GB
1xH100 80GB
2xL40S 48 GB
8xH100 80 GB
2x H200 141 GB
4x A100 80GB
8x H100 80 GB
8x H200 80 GB
4x H100 80 GB
8x H100 80 GB
8x H200 141 GB
16x H100 80GB
1x A6000 48GB
2x L40S 48GB
1x H100 80GB
4x L40S 48GB
8xH100 80GB
2x H100 80GB
4xH100 80GB
8x H100 80GB
8x H200 141GB
1x RTX4090
1x L40S
1x H100
1x RTX4090
1x H100
2x H100
2x H100
4x H100
8x H100
2x L40S
2x H100
4x H100
4x H100
4x H200
8x H100
8x H200
16x H100
16x H200
8x H100
8x H200
16xH200
8x H200
8x H200
16x H200
16x H100
24x H200
32x H200
1x RTX4090
1x L40S
1x H100
1x L40S
1x H100
2x H100
4x H100
8x H100
8x H200
8x H100
8x H200
8x H200
16x H100
16x H200
24x H200
24x H100
32x H200
40x H200
Classification
Segmentation
Generative models