Which compute instance for AI models training and inference?

60-80

Medium production

13B

1-10

24GB/64GB

< 150ms

1x RTX4090

10-15

Development/testing

13B

11-25

48GB/128GB

< 150ms

25-35

Small production

13B

26-50

80GB/192GB

< 150ms

VM/BM

50-70

Medium production

70B

1-10

80GB/192GB

< 200ms

VM/BM

5-8

Small production

70B

11-25

160GB/384GB

< 200ms

15-20

Medium production

70B

26-50

320GB/768GB

< 200ms

35-45

Large production

LLM inference Sizing - Medium Scale (51-200 concurrent users)

Model Size

Concurrent Users

VRAM/RAM

Latency Target

Recommended Instance

Type

Estimated RPS*

Notes

51-100

48GB/128GB

< 100ms

2x L40S

100-120

Production

101-150

80GB/192GB

< 100ms

VM/BM

150-180

High-performance

151-200

160GB/384GB

< 100ms

200-240

Enterprise scale

13B

51-100

160GB/384GB

< 150ms

80-100

Production

13B

101-150

240GB/512GB

< 150ms

4x H200

120-150

High-performance

13B

151-200

320GB/768GB

< 150ms

160-200

Enterprise scale

70B

51-100

480GB/1TB

< 200ms

60-80

Production

70B

101-150

640GB/1.5TB

< 200ms

16x H100

90-120

High-performance

70B

151-200

800GB/2TB

< 200ms

16x H200

140-180

Enterprise scale

LLM Inference Sizing - Large Scale (201-1000+ concurrent users)

Model Size

Concurrent Users

VRAM/RAM

Latency Target

Recommended Instance

Type

Estimated RPS*

Notes

201-500

320GB/768GB

< 100ms

300-400

Enterprise scale

501-1000

640GB/1.5TB

< 100ms

600-800

High-scale production

1000+

1.2TB/2.5TB

< 100ms

16xH200

1000+

Distributed clusters

13B

201-500

480GB/1TB

< 150ms

250-350

Enterprise scale

13B

501-1000

800GB/2TB

< 150ms

500-700

High-scale production

13B

1000+

1.6TB/3TB

< 150ms

16x H200

800+

Distributed clusters

70B

201-500

1.2TB/2.5TB

< 200ms

16x H100

200-300

Enterprise scale

70B

501-1000

2TB/4TB

< 200ms

24x H200

400-600

High-scale production

70B

1000+

3TB/6TB

< 200ms

32x H200

700+

Distributed clusters

2. Image Generation Inference Sizing

Deploying image generation models like Stable Diffusion for production introduces unique infrastructure challenges compared to traditional ML workload.

The hardware requirements vary significantly based on three key factors: model complexity (from base models to SDXL with refiners), concurrent user load (affecting batch processing and queue management), and image generation parameters (resolution, steps, and additional features like ControlNet or inpainting). Each of these factors directly impacts your choice of infrastructure and can significantly affect both performance and operational costs.

Small scale (1-50 concurrent users)

Model Type

Concurrent Users

VRAM/RAM

Latency Target*

Recommended Instance

Type

Images/Minute**

Notes

SD XL Base

1-10

16GB/32GB

< 3s

1x RTX4090

15-20

Development/testing

SD XL Base

11-25

24GB/64GB

< 3s

1x L40S

30-40

Small production

SD XL Base

26-50

48GB/128GB

< 3s

60-80

Medium production

SD XL + Refiner

1-10

24GB/64GB

< 5s

1x L40S

10-15

Development/testing

SD XL + Refiner

11-25

48GB/128GB

< 5s

25-35

Small production

SD XL + Refiner

26-50

80GB/192GB

< 5s

VM or BM

50-70

Medium production

Medium scale (51-200 concurrent users)

Model Type

Concurrent Users

VRAM/RAM

Latency Target*

Recommended Instance

Type

Images/Minute**

Notes

SD XL Base

51-100

160GB/384GB

< 3s

VM or BM

120-150

Production

SD XL Base

101-150

320GB/768GB

< 3s

200-250

High-performance

SD XL Base

151-200

480GB/1TB

< 3s

300-350

Enterprise scale

SD XL + Refiner

51-100

320GB/768GB

< 5s

100-130

Production

SD XL + Refiner

101-150

480GB/1TB

< 5s

180-220

High-performance

SD XL + Refiner

151-200

640GB/1.5TB

< 5s