Documentation Index
Fetch the complete documentation index at: https://mintlify.com/vllm-project/vllm/llms.txt
Use this file to discover all available pages before exploring further.
vLLM supports distributed inference across multiple GPUs and nodes to serve large models efficiently. This guide covers tensor parallelism (TP), pipeline parallelism (PP), and data parallelism (DP) strategies.
Parallelism strategies
| Strategy | Use case | Pros | Cons |
|---|
| Tensor Parallelism (TP) | Large models that don’t fit on single GPU | Low latency, simple setup | Limited by inter-GPU bandwidth |
| Pipeline Parallelism (PP) | Very large models across nodes | Better multi-node scaling | Higher latency due to pipeline bubbles |
| Data Parallelism (DP) | High throughput serving | Linear throughput scaling | Multiplies memory requirements |
These strategies can be combined. For example: DP=4 × TP=2 uses 8 GPUs total with 4 data parallel replicas, each using 2 GPUs for tensor parallelism.
Tensor parallelism
Tensor parallelism splits model layers across multiple GPUs on the same node.
Single node TP
vllm serve meta-llama/Llama-70B-Instruct \
--tensor-parallel-size 4
This distributes the 70B model across 4 GPUs on a single node.
How it works:
- Each layer’s weight matrices are split across GPUs
- All GPUs process the same batch simultaneously
- GPUs communicate via NVLink/PCIe for synchronization
- Single endpoint serves all requests
Multi-node TP
For very large models requiring more than 8 GPUs:
vllm serve meta-llama/Llama-405B-Instruct \
--tensor-parallel-size 16 \
--pipeline-parallel-size 1
Multi-node TP requires high-bandwidth networking (InfiniBand recommended). For standard Ethernet, prefer pipeline parallelism instead.
Configuration tips
Memory optimization
With quantization
Performance tuning
vllm serve meta-llama/Llama-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max-model-len 8192
vllm serve meta-llama/Llama-70B-Instruct \
--tensor-parallel-size 2 \
--quantization awq \
--dtype half
vllm serve meta-llama/Llama-70B-Instruct \
--tensor-parallel-size 4 \
--enable-prefix-caching \
--max-num-batched-tokens 16384
Pipeline parallelism
Pipeline parallelism splits model layers sequentially across GPUs or nodes.
Basic PP setup
vllm serve meta-llama/Llama-70B-Instruct \
--pipeline-parallel-size 4 \
--tensor-parallel-size 1
How it works:
- Model layers are divided into 4 stages
- Each stage runs on a separate GPU
- Requests flow through stages sequentially
- Good for multi-node deployments with standard networking
Combined TP + PP
For maximum flexibility, combine both strategies:
vllm serve meta-llama/Llama-405B-Instruct \
--tensor-parallel-size 4 \
--pipeline-parallel-size 4
This uses 16 GPUs total:
- 4 pipeline stages
- Each stage uses 4 GPUs with tensor parallelism
With TP + PP, the total GPU count = tensor_parallel_size × pipeline_parallel_size
Data parallelism
Data parallelism replicates the model across multiple GPUs/nodes to process different requests in parallel.
Internal load balancing
Single endpoint with automatic load balancing:
Single node DP
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 4
Creates 4 independent model replicas on 4 GPUs.DP + TP combined
vllm serve meta-llama/Llama-70B-Instruct \
--data-parallel-size 4 \
--tensor-parallel-size 2
Uses 8 GPUs: 4 replicas, each using 2 GPUs with TP.Multi-node DP
Run on head node (IP: 10.99.48.128):vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-address 10.99.48.128 \
--data-parallel-rpc-port 13345
Run on worker node:vllm serve meta-llama/Llama-3.2-1B-Instruct \
--headless \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 2 \
--data-parallel-address 10.99.48.128 \
--data-parallel-rpc-port 13345
Creates 4 replicas across 2 nodes (2 per node).
External load balancing
For production deployments with external load balancers:
# Rank 0
CUDA_VISIBLE_DEVICES=0 vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 2 \
--data-parallel-rank 0 \
--port 8000
# Rank 1
CUDA_VISIBLE_DEVICES=1 vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 2 \
--data-parallel-rank 1 \
--port 8001
Each rank exposes its own HTTP endpoint. Use an external load balancer (nginx, Kubernetes Ingress, etc.) to distribute requests.
Multi-node external LB:
# Rank 0 (Node 0 with IP 10.99.48.128)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 2 \
--data-parallel-rank 0 \
--data-parallel-address 10.99.48.128 \
--data-parallel-rpc-port 13345
# Rank 1 (Node 1)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 2 \
--data-parallel-rank 1 \
--data-parallel-address 10.99.48.128 \
--data-parallel-rpc-port 13345
Hybrid load balancing
Combine internal and external load balancing:
# Node 0
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-hybrid-lb \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 0 \
--data-parallel-address 10.99.48.128 \
--data-parallel-rpc-port 13345
# Node 1
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-hybrid-lb \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 2 \
--data-parallel-address 10.99.48.128 \
--data-parallel-rpc-port 13345
Each node has its own API endpoint(s) that load-balance across local DP ranks only.
Ray Data backend
Use Ray for automatic resource management:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 4 \
--data-parallel-backend=ray
Benefits:
- Single launch command for multi-node deployments
- Automatic resource allocation
- No need to specify addresses/ports manually
- Built-in fault tolerance
Set VLLM_RAY_DP_PACK_STRATEGY="span" when a single replica requires multiple nodes.
Scaling API servers
For high-throughput deployments, scale out API server processes:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 8 \
--api-server-count 4
This creates:
- 8 data parallel engine processes
- 4 API server processes (all on head node)
- Single HTTP endpoint with load balancing
How it works:
┌─────────────┐
│ Client │
└──────┬──────┘
│
▼
┌─────────────────┐
│ Load Balancer │ (Single endpoint)
└────────┬────────┘
│
┌────┴─────┬─────────┬─────────┐
▼ ▼ ▼ ▼
┌────┐ ┌────┐ ┌────┐ ┌────┐
│API │ │API │ │API │ │API │ (4 API servers)
│ 0 │ │ 1 │ │ 2 │ │ 3 │
└─┬──┘ └─┬──┘ └─┬──┘ └─┬──┘
│ │ │ │
└─────────┼─────────┼─────────┘
│ │
┌────────┼─────────┼─────────┬─────────┐
▼ ▼ ▼ ▼ ▼
┌────┐ ┌────┐ ┌────┐ ┌────┐ ┌────┐
│DP 0│ │DP 1│ │DP 2│ │DP 3│ │... │ (8 engines)
└────┘ └────┘ └────┘ └────┘ └────┘
MoE models and expert parallelism
For Mixture-of-Experts models like DeepSeek, use expert parallelism:
vllm serve deepseek-ai/DeepSeek-V3 \
--data-parallel-size 4 \
--tensor-parallel-size 2 \
--enable-expert-parallel
Without --enable-expert-parallel:
- Expert layers use tensor parallelism across DP × TP GPUs
- All DP ranks must synchronize on every forward pass
With --enable-expert-parallel:
Complete deployment examples
Small model (7B), high throughput
vllm serve meta-llama/Llama-3.2-7B-Instruct \
--data-parallel-size 4 \
--enable-prefix-caching \
--max-num-seqs 256 \
--gpu-memory-utilization 0.95
Large model (70B), single node
vllm serve meta-llama/Llama-70B-Instruct \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95 \
--max-model-len 8192 \
--enable-prefix-caching
Large model (70B), high throughput
vllm serve meta-llama/Llama-70B-Instruct \
--data-parallel-size 4 \
--tensor-parallel-size 4 \
--api-server-count 4 \
--enable-prefix-caching \
--max-num-batched-tokens 16384
Uses 16 GPUs: 4 replicas × 4 GPUs each
Very large model (405B), multi-node
vllm serve meta-llama/Llama-405B-Instruct \
--tensor-parallel-size 8 \
--pipeline-parallel-size 4 \
--gpu-memory-utilization 0.95 \
--max-model-len 4096
Uses 32 GPUs: 4 pipeline stages × 8 GPUs per stage
Latency
Throughput
Memory
Minimize latency:
- Use tensor parallelism over pipeline parallelism
- Keep TP within single node (NVLink bandwidth)
- Reduce
--max-num-seqs for lower queue time
vllm serve meta-llama/Llama-70B-Instruct \
--tensor-parallel-size 8 \
--max-num-seqs 64
Maximize throughput:
- Use data parallelism for horizontal scaling
- Enable prefix caching for repeated prompts
- Increase batch size limits
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 8 \
--enable-prefix-caching \
--max-num-seqs 512 \
--max-num-batched-tokens 32768
Optimize memory:
- Use quantization (AWQ, GPTQ)
- Reduce max sequence length
- Lower GPU memory utilization if OOM
vllm serve meta-llama/Llama-70B-Instruct \
--tensor-parallel-size 4 \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.85
Monitoring and debugging
Check GPU utilization
Enable stats logging
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 4
Check logs for throughput metrics:
INFO: Avg prompt throughput: 1234.5 tokens/s
INFO: Avg generation throughput: 567.8 tokens/s
Disable stats logging
For production, reduce log verbosity:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--disable-log-stats
Troubleshooting
Common issues:
- NCCL timeout errors: Increase timeout with
NCCL_TIMEOUT=1800
- Out of memory: Reduce
--gpu-memory-utilization or --max-model-len
- Slow multi-node: Check network bandwidth, consider pipeline parallelism
- Uneven load: Use external load balancer with health checks
Environment variables
# NCCL settings for multi-GPU
export NCCL_TIMEOUT=1800
export NCCL_DEBUG=INFO
# Worker process method
export VLLM_WORKER_MULTIPROC_METHOD=spawn
# Ray settings
export VLLM_RAY_DP_PACK_STRATEGY="span"
vllm serve meta-llama/Llama-3.2-1B-Instruct --data-parallel-size 4
Additional resources