Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/vllm-project/vllm/llms.txt

Use this file to discover all available pages before exploring further.

Overview

This guide covers best practices, architectural patterns, and operational considerations for running vLLM in production at scale.

Architecture patterns

Single-instance deployment

Simplest deployment for low-to-medium traffic:
┌─────────────┐
│   Client    │
└──────┬──────┘

       v
┌─────────────┐
│  vLLM Pod   │
│  (1x GPU)   │
└─────────────┘
Use when:
  • QPS < 10
  • Single model serving
  • Development/testing environments

Load-balanced deployment

Multiple replicas behind a load balancer:
┌─────────────┐
│   Client    │
└──────┬──────┘

       v
┌─────────────────┐
│ Load Balancer   │
│  (Nginx/K8s)    │
└────────┬────────┘

    ┌────┴────┬────────┬────────┐
    v         v        v        v
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ vLLM  │ │ vLLM  │ │ vLLM  │ │ vLLM  │
│ Pod 1 │ │ Pod 2 │ │ Pod 3 │ │ Pod N │
└───────┘ └───────┘ └───────┘ └───────┘
Use when:
  • QPS > 10
  • High availability required
  • Horizontal scaling needed

Multi-model deployment

Serve multiple models with routing:
┌─────────────┐
│   Client    │
└──────┬──────┘

       v
┌─────────────────┐
│  Model Router   │
└────────┬────────┘

    ┌────┴─────┬──────────┐
    v          v          v
┌────────┐ ┌────────┐ ┌────────┐
│ Model  │ │ Model  │ │ Model  │
│  7B    │ │  13B   │ │  70B   │
└────────┘ └────────┘ └────────┘
Use when:
  • Multiple models needed
  • Different performance tiers
  • Cost optimization

Load balancing

Nginx configuration

1

Create Nginx configuration

upstream vllm_backend {
    least_conn;  # Use least connections algorithm
    server vllm0:8000 max_fails=3 fail_timeout=30s;
    server vllm1:8000 max_fails=3 fail_timeout=30s;
    server vllm2:8000 max_fails=3 fail_timeout=30s;
    server vllm3:8000 max_fails=3 fail_timeout=30s;
    
    keepalive 32;  # Connection pooling
}

server {
    listen 80;
    
    location / {
        proxy_pass http://vllm_backend;
        proxy_http_version 1.1;
        proxy_set_header Connection "";
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        
        # Timeouts for long-running requests
        proxy_connect_timeout 300s;
        proxy_send_timeout 300s;
        proxy_read_timeout 300s;
    }
    
    location /health {
        proxy_pass http://vllm_backend/health;
        proxy_http_version 1.1;
    }
}
2

Deploy with Docker Compose

version: '3.8'

services:
  nginx:
    image: nginx:latest
    ports:
      - "8000:80"
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - vllm0
      - vllm1
      - vllm2
      - vllm3
    networks:
      - vllm-network

  vllm0:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=0
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

  vllm1:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=1
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

  vllm2:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=2
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

  vllm3:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=3
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    shm_size: 10gb
    ipc: host
    command: --model meta-llama/Meta-Llama-3-8B-Instruct
    networks:
      - vllm-network

networks:
  vllm-network:
    driver: bridge

Kubernetes Service with session affinity

Enable prefix caching by routing requests to the same pod:
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
  - port: 80
    targetPort: 8000
  sessionAffinity: ClientIP
  sessionAffinityConfig:
    clientIP:
      timeoutSeconds: 3600  # 1 hour
Session affinity improves cache hit rates for prefix caching, reducing latency and cost.

Performance optimization

Model configuration

Optimal vLLM settings for production:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 256 \
  --disable-log-requests \
  --trust-remote-code
Key parameters:
ParameterRecommendedPurpose
gpu-memory-utilization0.85-0.90Leave headroom for fragmentation
max-model-lenModel-specificReduce for higher throughput
max-num-seqs128-256Balance latency vs throughput
enable-prefix-cachingtrueCache common prompts
enable-chunked-prefilltrueReduce TTFT for long prompts
disable-log-requeststrueReduce logging overhead

Quantization

Reduce memory usage and increase throughput:
vllm serve TheBloke/Llama-2-70B-AWQ \
  --quantization awq \
  --tensor-parallel-size 4
Quantization comparison:
MethodMemory SavingsQualitySpeed
FP16 (baseline)0%100%1.0x
FP850%98-99%1.5-2.0x
AWQ/GPTQ75%95-98%1.2-1.5x

Multi-GPU tensor parallelism

For large models, split across multiple GPUs:
# 70B model on 4x A100 GPUs
vllm serve meta-llama/Meta-Llama-3-70B-Instruct \
  --tensor-parallel-size 4 \
  --max-model-len 8192
Tensor parallelism requires high-bandwidth interconnects (NVLink, InfiniBand). Use on single-node multi-GPU systems.

Monitoring and observability

Prometheus metrics

vLLM exposes Prometheus metrics at /metrics:
apiVersion: v1
kind: Service
metadata:
  name: vllm-metrics
  labels:
    app: vllm
spec:
  ports:
  - name: metrics
    port: 8000
    targetPort: 8000
  selector:
    app: vllm
---
apiVersion: v1
kind: ServiceMonitor
metadata:
  name: vllm-monitor
spec:
  selector:
    matchLabels:
      app: vllm
  endpoints:
  - port: metrics
    path: /metrics
Key metrics to monitor:
  • vllm:num_requests_running - Active requests
  • vllm:num_requests_waiting - Queued requests
  • vllm:gpu_cache_usage_perc - GPU memory utilization
  • vllm:avg_generation_throughput_toks_per_s - Throughput
  • vllm:time_to_first_token_seconds - TTFT latency
  • vllm:time_per_output_token_seconds - Generation latency

Grafana dashboard

Example Grafana queries:
# Request rate
rate(vllm:request_success_total[5m])

# Average TTFT
rate(vllm:time_to_first_token_seconds_sum[5m]) / rate(vllm:time_to_first_token_seconds_count[5m])

# P95 generation latency
histogram_quantile(0.95, rate(vllm:e2e_request_latency_seconds_bucket[5m]))

# GPU utilization
vllm:gpu_cache_usage_perc

OpenTelemetry tracing

Enable distributed tracing:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --otlp-traces-endpoint http://jaeger:4318/v1/traces

Health checks and probes

Kubernetes probes

livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 300
  periodSeconds: 30
  timeoutSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 60
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

startupProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 60  # 10 minutes for large models
Set failureThreshold high enough for large models to load. A 70B model can take 5-10 minutes to initialize.

Autoscaling

Horizontal Pod Autoscaler (HPA)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_num_requests_running
      target:
        type: AverageValue
        averageValue: "50"  # Scale when >50 concurrent requests per pod
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300  # Wait 5 min before scaling down
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0  # Scale up immediately
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

SkyPilot autoscaling

service:
  replica_policy:
    min_replicas: 2
    max_replicas: 10
    target_qps_per_replica: 5  # Scale when QPS > 5 per replica
    upscale_delay_seconds: 60
    downscale_delay_seconds: 300

Security best practices

API authentication

Use API keys for authentication:
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
  --api-key your-secret-key
Client usage:
import openai

client = openai.OpenAI(
    base_url="http://vllm:8000/v1",
    api_key="your-secret-key"
)

Network policies

Restrict pod-to-pod communication:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: vllm-network-policy
spec:
  podSelector:
    matchLabels:
      app: vllm
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          role: api-gateway
    ports:
    - protocol: TCP
      port: 8000
  egress:
  - to:
    - podSelector:
        matchLabels:
          role: model-storage

Secrets management

Use Kubernetes secrets or cloud secret managers:
apiVersion: v1
kind: Secret
metadata:
  name: hf-token
type: Opaque
stringData:
  token: hf_xxxxxxxxxxxxx
---
env:
- name: HF_TOKEN
  valueFrom:
    secretKeyRef:
      name: hf-token
      key: token

Disaster recovery

Model checkpointing

Store models in persistent storage:
volumes:
- name: model-cache
  persistentVolumeClaim:
    claimName: vllm-models
volumeMounts:
- name: model-cache
  mountPath: /root/.cache/huggingface

Multi-region deployment

Deploy across multiple regions for high availability:
┌──────────────┐     ┌──────────────┐
│  Region 1    │     │  Region 2    │
│  (Primary)   │     │  (Failover)  │
├──────────────┤     ├──────────────┤
│ vLLM Cluster │     │ vLLM Cluster │
│  (3 pods)    │     │  (3 pods)    │
└──────────────┘     └──────────────┘
        │                    │
        └────────┬───────────┘
                 v
         ┌──────────────┐
         │ Global Load  │
         │  Balancer    │
         └──────────────┘

Cost optimization

1

Right-size GPU allocation

Match GPU to model size:
  • 7B models: T4 (16GB) or L4 (24GB)
  • 13B models: L4 (24GB) or A10G (24GB)
  • 70B models: A100 40GB x2 or A100 80GB x1
2

Use quantization

Reduce GPU requirements with AWQ/GPTQ/FP8 quantization.
3

Enable autoscaling

Scale to zero during off-peak hours.
4

Batch requests

Use continuous batching to maximize throughput.
5

Enable prefix caching

Cache common system prompts to reduce compute.

Troubleshooting

High latency

Symptoms: Slow response times Solutions:
  1. Check GPU utilization with nvidia-smi
  2. Reduce max-model-len to free memory
  3. Enable chunked prefill
  4. Add more replicas
  5. Enable quantization

OOM errors

Symptoms: CUDA out of memory Solutions:
  1. Reduce gpu-memory-utilization to 0.85
  2. Reduce max-num-seqs
  3. Reduce max-model-len
  4. Enable quantization
  5. Use tensor parallelism

Request timeouts

Symptoms: 504 Gateway Timeout Solutions:
  1. Increase proxy timeouts in Nginx/K8s
  2. Increase readinessProbe timeout
  3. Check for deadlocked requests with metrics
  4. Review max-num-batched-tokens

Checklist

Before going to production:
  • Load testing completed (target QPS)
  • Monitoring and alerting configured
  • Health checks validated
  • Autoscaling tested
  • Disaster recovery plan documented
  • Security review completed
  • Cost analysis performed
  • SLO/SLA defined
  • Rollback procedure documented
  • On-call rotation established

Next steps