CLI usage guide

The vLLM CLI provides commands for serving models, running benchmarks, and managing your deployment. All commands follow the pattern vllm [subcommand] [options].

Available commands

View all available commands:

vllm --help

Command	Description
`vllm serve`	Launch OpenAI-compatible API server
`vllm bench`	Run performance benchmarks
`vllm collect-env`	Collect environment information for debugging
`vllm run-batch`	Run offline batch inference

vllm serve

Launch an OpenAI-compatible HTTP API server to serve LLM completions.

Basic usage

vllm serve [model_tag] [options]

Examples:

vllm serve

Model configuration

Model loading
Quantization
Tensor parallelism
Pipeline parallelism

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --dtype auto \
  --max-model-len 4096 \
  --trust-remote-code

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --quantization awq \
  --dtype half

vllm serve meta-llama/Llama-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.95

vllm serve meta-llama/Llama-70B-Instruct \
  --pipeline-parallel-size 2 \
  --tensor-parallel-size 2

Data parallel deployment

Single node, multiple GPUs

Launch with data parallelism on a single node:

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 4 \
  --tensor-parallel-size 2

This uses 8 GPUs total (4 DP ranks × 2 TP size).

Multi-node with internal load balancing

Run on multiple nodes with a single API endpoint:

# Node 0 (head node with IP 10.99.48.128)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

# Node 1
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --headless \
  --data-parallel-size 4 \
  --data-parallel-size-local 2 \
  --data-parallel-start-rank 2 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

Multi-node with external load balancing

Run each DP rank as a separate server:

# Rank 0 (IP: 10.99.48.128)
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 2 \
  --data-parallel-rank 0 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

# Rank 1
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --data-parallel-size 2 \
  --data-parallel-rank 1 \
  --data-parallel-address 10.99.48.128 \
  --data-parallel-rpc-port 13345

Performance options

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 256 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.95

Key parameters:

--max-num-batched-tokens - Maximum tokens processed in a single batch
--max-num-seqs - Maximum number of sequences in a batch
--enable-prefix-caching - Enable KV cache reuse for repeated prompts
--enable-chunked-prefill - Split large prompts into chunks
--gpu-memory-utilization - Fraction of GPU memory to use (0.0-1.0)

Chat templates

Specify a custom chat template:

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --chat-template ./templates/custom_chat.jinja

Override content format detection:

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --chat-template-content-format openai

Server options

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key token-abc123 \
  --enable-request-id-headers \
  --enable-offline-docs \
  --uvicorn-log-level info

Server parameters:

--host - Host IP address (default: None)
--port - Port number (default: 8000)
--api-key - API key for authentication
--enable-request-id-headers - Enable X-Request-ID header tracking
--enable-offline-docs - Enable offline API documentation
--uvicorn-log-level - Logging level: critical, error, warning, info, debug, trace

Advanced deployment

# For multi-node deployments - no API server
vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --headless \
  --data-parallel-size 4

Environment variables

Common environment variables for vllm serve:

# CUDA devices
export CUDA_VISIBLE_DEVICES=0,1,2,3

# vLLM settings
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export VLLM_ALLOW_RUNTIME_LORA_UPDATING=1
export VLLM_MAX_AUDIO_CLIP_FILESIZE_MB=25

vllm serve meta-llama/Llama-3.2-1B-Instruct

vllm bench

Run performance benchmarks to measure throughput and latency.

Benchmark types

Throughput
Latency
Serving

vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --dataset-name sonnet \
  --num-prompts 1000

vllm bench latency \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --input-len 128 \
  --output-len 64

vllm bench serve \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --backend vllm \
  --dataset-name random \
  --num-prompts 1000 \
  --request-rate 10

Benchmark options

vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --dataset-name sharegpt \
  --num-prompts 1000 \
  --tensor-parallel-size 2 \
  --enable-prefix-caching \
  --max-num-batched-tokens 8192

Common parameters:

--model - Model to benchmark
--dataset-name - Dataset to use (sonnet, sharegpt, random)
--num-prompts - Number of prompts to process
--input-len - Input sequence length
--output-len - Output sequence length
--request-rate - Requests per second (for serving benchmarks)

vllm collect-env

Collect environment information for debugging and issue reporting:

vllm collect-env

This outputs:

Python version
PyTorch version
CUDA version
vLLM version
GPU information
System details

Example output:

vLLM Version: 0.6.0
Python Version: 3.10.12
PyTorch Version: 2.4.0+cu121
CUDA Version: 12.1
GPU: NVIDIA A100-SXM4-80GB

vllm run-batch

Run offline batch inference from command line:

vllm run-batch \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --input-file prompts.jsonl \
  --output-file results.jsonl

Input file format (JSONL):

{"prompt": "Hello, my name is", "max_tokens": 50}
{"prompt": "The capital of France is", "max_tokens": 50}

Contextual help

Get help for specific command groups:

# View model configuration options
vllm serve --help=ModelConfig

# View frontend server options
vllm serve --help=Frontend

# View all options at once
vllm serve --help=all

Version information

Check vLLM version:

vllm --version

Common workflows

Development server

Quick local server for testing:

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --port 8000 \
  --max-model-len 2048

Production deployment

Production-ready configuration:

vllm serve meta-llama/Llama-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key $API_KEY \
  --tensor-parallel-size 4 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 256 \
  --disable-log-stats

Benchmark comparison

Compare different configurations:

# Baseline
vllm bench throughput --model meta-llama/Llama-3.2-1B-Instruct

# With prefix caching
vllm bench throughput \
  --model meta-llama/Llama-3.2-1B-Instruct \
  --enable-prefix-caching

For complete parameter documentation, visit the configuration reference.

Troubleshooting

If you encounter CUDA out-of-memory errors, try:

vllm serve meta-llama/Llama-3.2-1B-Instruct \
  --gpu-memory-utilization 0.8 \
  --max-model-len 2048

Common issues:

Port already in use: Change the port with --port 8080
Model not found: Ensure HuggingFace credentials are set: huggingface-cli login
GPU memory issues: Reduce --gpu-memory-utilization or --max-model-len
Slow startup: Add --enforce-eager to skip CUDA graph compilation

Examples

Explore full CLI examples:

Server configuration - vllm/entrypoints/cli/serve.py:33
Multi-node setup script
Data parallel deployment

Documentation Index

​Available commands

​vllm serve

​Basic usage

​Model configuration

​Data parallel deployment

​Performance options

​Chat templates

​Server options

​Advanced deployment

​Environment variables

​vllm bench

​Benchmark types

​Benchmark options

​vllm collect-env

​vllm run-batch

​Contextual help

​Version information

​Common workflows

​Troubleshooting

​Examples

Available commands

vllm serve

Basic usage

Model configuration

Data parallel deployment

Performance options

Chat templates

Server options

Advanced deployment

Environment variables

vllm bench

Benchmark types

Benchmark options

vllm collect-env

vllm run-batch

Contextual help

Version information

Common workflows

Troubleshooting

Examples