Documentation Index Fetch the complete documentation index at: https://mintlify.com/vllm-project/vllm/llms.txt
Use this file to discover all available pages before exploring further.
The vLLM CLI provides commands for serving models, running benchmarks, and managing your deployment. All commands follow the pattern vllm [subcommand] [options].
Available commands
View all available commands:
Command Description vllm serveLaunch OpenAI-compatible API server vllm benchRun performance benchmarks vllm collect-envCollect environment information for debugging vllm run-batchRun offline batch inference
vllm serve
Launch an OpenAI-compatible HTTP API server to serve LLM completions.
Basic usage
vllm serve [model_tag] [options]
Examples:
Default model
Custom model
With API key
Custom port and host
Model configuration
Model loading
Quantization
Tensor parallelism
Pipeline parallelism
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--dtype auto \
--max-model-len 4096 \
--trust-remote-code
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--quantization awq \
--dtype half
vllm serve meta-llama/Llama-70B-Instruct \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.95
vllm serve meta-llama/Llama-70B-Instruct \
--pipeline-parallel-size 2 \
--tensor-parallel-size 2
Data parallel deployment
Single node, multiple GPUs
Launch with data parallelism on a single node: vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 4 \
--tensor-parallel-size 2
This uses 8 GPUs total (4 DP ranks × 2 TP size).
Multi-node with internal load balancing
Run on multiple nodes with a single API endpoint: # Node 0 (head node with IP 10.99.48.128)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-address 10.99.48.128 \
--data-parallel-rpc-port 13345
# Node 1
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--headless \
--data-parallel-size 4 \
--data-parallel-size-local 2 \
--data-parallel-start-rank 2 \
--data-parallel-address 10.99.48.128 \
--data-parallel-rpc-port 13345
Multi-node with external load balancing
Run each DP rank as a separate server: # Rank 0 (IP: 10.99.48.128)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 2 \
--data-parallel-rank 0 \
--data-parallel-address 10.99.48.128 \
--data-parallel-rpc-port 13345
# Rank 1
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--data-parallel-size 2 \
--data-parallel-rank 1 \
--data-parallel-address 10.99.48.128 \
--data-parallel-rpc-port 13345
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--max-num-batched-tokens 8192 \
--max-num-seqs 256 \
--enable-prefix-caching \
--enable-chunked-prefill \
--gpu-memory-utilization 0.95
Key parameters:
--max-num-batched-tokens - Maximum tokens processed in a single batch
--max-num-seqs - Maximum number of sequences in a batch
--enable-prefix-caching - Enable KV cache reuse for repeated prompts
--enable-chunked-prefill - Split large prompts into chunks
--gpu-memory-utilization - Fraction of GPU memory to use (0.0-1.0)
Chat templates
Specify a custom chat template:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--chat-template ./templates/custom_chat.jinja
Override content format detection:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--chat-template-content-format openai
Server options
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--api-key token-abc123 \
--enable-request-id-headers \
--enable-offline-docs \
--uvicorn-log-level info
Server parameters:
--host - Host IP address (default: None)
--port - Port number (default: 8000)
--api-key - API key for authentication
--enable-request-id-headers - Enable X-Request-ID header tracking
--enable-offline-docs - Enable offline API documentation
--uvicorn-log-level - Logging level: critical, error, warning, info, debug, trace
Advanced deployment
Headless mode
Multiple API servers
Config file
# For multi-node deployments - no API server
vllm serve meta-llama/Llama-3.2-1B-Instruct \
--headless \
--data-parallel-size 4
Environment variables
Common environment variables for vllm serve:
# CUDA devices
export CUDA_VISIBLE_DEVICES = 0 , 1 , 2 , 3
# vLLM settings
export VLLM_WORKER_MULTIPROC_METHOD = spawn
export VLLM_ALLOW_RUNTIME_LORA_UPDATING = 1
export VLLM_MAX_AUDIO_CLIP_FILESIZE_MB = 25
vllm serve meta-llama/Llama-3.2-1B-Instruct
vllm bench
Run performance benchmarks to measure throughput and latency.
Benchmark types
Throughput
Latency
Serving
vllm bench throughput \
--model meta-llama/Llama-3.2-1B-Instruct \
--dataset-name sonnet \
--num-prompts 1000
vllm bench latency \
--model meta-llama/Llama-3.2-1B-Instruct \
--input-len 128 \
--output-len 64
vllm bench serve \
--model meta-llama/Llama-3.2-1B-Instruct \
--backend vllm \
--dataset-name random \
--num-prompts 1000 \
--request-rate 10
Benchmark options
vllm bench throughput \
--model meta-llama/Llama-3.2-1B-Instruct \
--dataset-name sharegpt \
--num-prompts 1000 \
--tensor-parallel-size 2 \
--enable-prefix-caching \
--max-num-batched-tokens 8192
Common parameters:
--model - Model to benchmark
--dataset-name - Dataset to use (sonnet, sharegpt, random)
--num-prompts - Number of prompts to process
--input-len - Input sequence length
--output-len - Output sequence length
--request-rate - Requests per second (for serving benchmarks)
vllm collect-env
Collect environment information for debugging and issue reporting:
This outputs:
Python version
PyTorch version
CUDA version
vLLM version
GPU information
System details
Example output:
vLLM Version: 0.6.0
Python Version: 3.10.12
PyTorch Version: 2.4.0+cu121
CUDA Version: 12.1
GPU: NVIDIA A100-SXM4-80GB
vllm run-batch
Run offline batch inference from command line:
vllm run-batch \
--model meta-llama/Llama-3.2-1B-Instruct \
--input-file prompts.jsonl \
--output-file results.jsonl
Input file format (JSONL):
{ "prompt" : "Hello, my name is" , "max_tokens" : 50 }
{ "prompt" : "The capital of France is" , "max_tokens" : 50 }
Contextual help
Get help for specific command groups:
# View model configuration options
vllm serve --help=ModelConfig
# View frontend server options
vllm serve --help=Frontend
# View all options at once
vllm serve --help=all
Check vLLM version:
Common workflows
Development server
Quick local server for testing: vllm serve meta-llama/Llama-3.2-1B-Instruct \
--port 8000 \
--max-model-len 2048
Production deployment
Production-ready configuration: vllm serve meta-llama/Llama-70B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--api-key $API_KEY \
--tensor-parallel-size 4 \
--enable-prefix-caching \
--gpu-memory-utilization 0.95 \
--max-num-seqs 256 \
--disable-log-stats
Benchmark comparison
Compare different configurations: # Baseline
vllm bench throughput --model meta-llama/Llama-3.2-1B-Instruct
# With prefix caching
vllm bench throughput \
--model meta-llama/Llama-3.2-1B-Instruct \
--enable-prefix-caching
Troubleshooting
If you encounter CUDA out-of-memory errors, try: vllm serve meta-llama/Llama-3.2-1B-Instruct \
--gpu-memory-utilization 0.8 \
--max-model-len 2048
Common issues:
Port already in use : Change the port with --port 8080
Model not found : Ensure HuggingFace credentials are set: huggingface-cli login
GPU memory issues : Reduce --gpu-memory-utilization or --max-model-len
Slow startup : Add --enforce-eager to skip CUDA graph compilation
Examples
Explore full CLI examples: