Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/vllm-project/vllm/llms.txt

Use this file to discover all available pages before exploring further.

The vLLM CLI provides commands for serving models, running benchmarks, and managing your deployment. All commands follow the pattern vllm [subcommand] [options].

Available commands

View all available commands:
vllm --help
CommandDescription
vllm serveLaunch OpenAI-compatible API server
vllm benchRun performance benchmarks
vllm collect-envCollect environment information for debugging
vllm run-batchRun offline batch inference

vllm serve

Launch an OpenAI-compatible HTTP API server to serve LLM completions.

Basic usage

vllm serve [model_tag] [options]
Examples:
vllm serve

Model configuration

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --dtype auto \
  --max-model-len 4096 \
  --trust-remote-code

Data parallel deployment

1

Single node, multiple GPUs

Launch with data parallelism on a single node:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 4 \
  --tensor-parallel-size 2
This uses 8 GPUs total (4 DP ranks × 2 TP size).
2

Multi-node with internal load balancing

Run on multiple nodes with a single API endpoint:
# Node 0 (head node with IP 10.99.48.128)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

# Node 1
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --headless \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-start-rank 2 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345
3

Multi-node with external load balancing

Run each DP rank as a separate server:
# Rank 0 (IP: 10.99.48.128)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 2 \
  --data-parallel-rank 0 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

# Rank 1
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 2 \
  --data-parallel-rank 1 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

Performance options

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.95
Key parameters:
  • --max-num-batched-tokens - Maximum tokens processed in a single batch
  • --max-num-seqs - Maximum number of sequences in a batch
  • --enable-prefix-caching - Enable KV cache reuse for repeated prompts
  • --enable-chunked-prefill - Split large prompts into chunks
  • --gpu-memory-utilization - Fraction of GPU memory to use (0.0-1.0)

Chat templates

Specify a custom chat template:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --chat-template ./templates/custom_chat.jinja
Override content format detection:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --chat-template-content-format openai

Server options

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key token-abc123 \
  --enable-request-id-headers \
  --enable-offline-docs \
  --uvicorn-log-level info
Server parameters:
  • --host - Host IP address (default: None)
  • --port - Port number (default: 8000)
  • --api-key - API key for authentication
  • --enable-request-id-headers - Enable X-Request-ID header tracking
  • --enable-offline-docs - Enable offline API documentation
  • --uvicorn-log-level - Logging level: critical, error, warning, info, debug, trace

Advanced deployment

# For multi-node deployments - no API server
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --headless \
  --data-parallel-size 4

Environment variables

Common environment variables for vllm serve:
# CUDA devices
export CUDA_VISIBLE_DEVICES=0,1,2,3

# vLLM settings
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=1
export VLLM_MAX_AUDIO_CLIP_FILESIZE_MB=25

vllm serve meta-llama/Llama-3.2-1B-Instruct

vllm bench

Run performance benchmarks to measure throughput and latency.

Benchmark types

vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --dataset-name sonnet \
  --num-prompts 1000

Benchmark options

vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 1000 \
  --tensor-parallel-size 2 \
  --enable-prefix-caching \
  --max-num-batched-tokens 8192
Common parameters:
  • --model - Model to benchmark
  • --dataset-name - Dataset to use (sonnet, sharegpt, random)
  • --num-prompts - Number of prompts to process
  • --input-len - Input sequence length
  • --output-len - Output sequence length
  • --request-rate - Requests per second (for serving benchmarks)

vllm collect-env

Collect environment information for debugging and issue reporting:
vllm collect-env
This outputs:
  • Python version
  • PyTorch version
  • CUDA version
  • vLLM version
  • GPU information
  • System details
Example output:
vLLM Version: 0.6.0
Python Version: 3.10.12
PyTorch Version: 2.4.0+cu121
CUDA Version: 12.1
GPU: NVIDIA A100-SXM4-80GB

vllm run-batch

Run offline batch inference from command line:
vllm run-batch \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --input-file prompts.jsonl \
  --output-file results.jsonl
Input file format (JSONL):
{"prompt": "Hello, my name is", "max_tokens": 50}
{"prompt": "The capital of France is", "max_tokens": 50}

Contextual help

Get help for specific command groups:
# View model configuration options
vllm serve --help=ModelConfig

# View frontend server options
vllm serve --help=Frontend

# View all options at once
vllm serve --help=all

Version information

Check vLLM version:
vllm --version

Common workflows

1

Development server

Quick local server for testing:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --port 8000 \
  --max-model-len 2048
2

Production deployment

Production-ready configuration:
vllm serve meta-llama/Llama-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key $API_KEY \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 256 \
  --disable-log-stats
3

Benchmark comparison

Compare different configurations:
# Baseline
vllm bench throughput --model meta-llama/Llama-3.2-1B-Instruct

# With prefix caching
vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --enable-prefix-caching
For complete parameter documentation, visit the configuration reference.

Troubleshooting

If you encounter CUDA out-of-memory errors, try:
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --gpu-memory-utilization 0.8 \
  --max-model-len 2048
Common issues:
  1. Port already in use: Change the port with --port 8080
  2. Model not found: Ensure HuggingFace credentials are set: huggingface-cli login
  3. GPU memory issues: Reduce --gpu-memory-utilization or --max-model-len
  4. Slow startup: Add --enforce-eager to skip CUDA graph compilation

Examples

Explore full CLI examples: