Tensor parallelism and pipeline parallelism enable vLLM to run models that don’t fit on a single GPU by distributing the model across multiple GPUs and nodes.Documentation Index
Fetch the complete documentation index at: https://mintlify.com/vllm-project/vllm/llms.txt
Use this file to discover all available pages before exploring further.
Parallelism strategies
vLLM supports three distributed inference strategies:Tensor parallelism (single-node multi-GPU)
Splits individual weight tensors across multiple GPUs. All GPUs work on the same batch of requests simultaneously. When to use:- Model fits on a single node but not a single GPU
- You have multiple GPUs with fast interconnect (NVLink)
- Low latency is critical
Pipeline parallelism (multi-node)
Splits the model into stages by layers, with each stage running on a different GPU or node. Requests flow through stages sequentially. When to use:- Model doesn’t fit on a single node
- You have multiple nodes
- GPU count doesn’t evenly divide for tensor parallelism
- GPUs lack NVLink (e.g., L40S)
Expert parallelism (MoE models)
For Mixture-of-Experts models, distribute expert layers separately for better load balancing. When to use:- Running MoE models (Mixtral, DeepSeek, etc.)
- Want to optimize expert-level parallelism
Quick start
Single-node (tensor parallelism)
Offline inference:Multi-node (tensor + pipeline parallelism)
Using multiprocessing (simple setup): On head node:Choosing parallelism strategy
Use this decision tree:Example configurations
Single node with 4x A100 80GB:Memory and capacity planning
After configuring parallelism, check GPU memory usage:- GPU KV cache size: Total tokens that fit in GPU KV cache
- Maximum concurrency: How many requests can run simultaneously
- Add more GPUs to increase
tensor_parallel_size - Add more nodes to increase
pipeline_parallel_size - Reduce
gpu_memory_utilizationif you have other memory needs
Multi-node deployment
vLLM supports two backends for multi-node deployment:Ray (recommended for production)
Ray provides better fault tolerance, resource management, and scaling. 1. Set up Ray cluster using containers: On head node:vLLM automatically discovers all GPUs in the Ray cluster. Set
tensor_parallel_size to GPUs per node and pipeline_parallel_size to number of nodes.Multiprocessing (simpler setup)
For quick testing without Ray overhead. On head node (rank 0):Network optimization
Tensor parallelism requires fast inter-GPU communication. Optimize for:InfiniBand (recommended)
For multi-node tensor parallelism, use InfiniBand for best performance.NCCL_IB_HCA value.
GPUDirect RDMA
Enable GPU-to-GPU communication over InfiniBand without CPU involvement. Docker:- ✅
[send] via NET/IB/GDRDMA- InfiniBand with GPUDirect (good) - ❌
[send] via NET/Socket- TCP fallback (inefficient for tensor parallelism)
Configuration parameters
Tensor parallelism
Number of GPUs to split tensor weights across.
- Typically set to number of GPUs per node
- Requires fast interconnect (NVLink or InfiniBand)
Pipeline parallelism
Number of pipeline stages (typically number of nodes).
- Model is split into stages by layers
- Each stage processes different requests sequentially
Distributed backend
Backend for distributed execution.
"ray": Use Ray (multi-node recommended)"mp": Use multiprocessing (single-node default)"auto": Automatically choose based on environment
Multi-node configuration
Total number of nodes (for multiprocessing backend).
Rank of current node, starting from 0 (for multiprocessing backend).
IP address of the head node (for multiprocessing backend).
Complete examples
Example 1: Single-node tensor parallelism
Example 2: Multi-node pipeline parallelism
Example 3: Uneven GPU split with pipeline parallelism
Troubleshooting
Out of memory errors
Symptom:CUDA out of memory errors
Solutions:
- Increase
tensor_parallel_sizeorpipeline_parallel_size - Reduce
gpu_memory_utilization(default 0.9) - Reduce
max_model_len - Enable quantization (FP8, INT8, etc.)
Slow performance with tensor parallelism
Symptom: No speedup or slower than single GPU Possible causes:- No NVLink between GPUs
- Network bottleneck in multi-node setup
- Small batch sizes (communication overhead dominates)
- Use pipeline parallelism instead if no NVLink
- Check InfiniBand/GPUDirect RDMA configuration
- Increase batch size or concurrent requests
Communication errors
Symptom: NCCL errors, timeout errors Solutions:- Verify all nodes can reach each other via IP
- Check firewall settings
- Ensure consistent vLLM versions across nodes
- Set
NCCL_DEBUG=INFOfor detailed logs
Different outputs across runs
Symptom: Same prompt produces different outputs Cause: Batch size variations affecting numerical stability Solution: Set explicit seed inSamplingParams
Best practices
Network security
SetVLLM_HOST_IP to private network addresses:
Pre-download models
Download models on all nodes before starting vLLM to avoid concurrent download issues.
Container consistency
Use identical container images across all nodes:Resource allocation
For Ray clusters on Kubernetes, use KubeRay for automated resource management.Related resources
- Parallelism scaling guide - Detailed strategy guide
- Data parallel deployment - MoE model parallelism
- Distributed troubleshooting - Debug distributed issues
- Source:
docs/serving/parallelism_scaling.md - Source:
vllm/distributed/parallel_state.py:1