vLLM API#
The vLLM APIs provide an interface to deploy and interact with large language models locally using the vLLM inference engine. vLLM offers high-throughput inference with optimized serving.
Overview#
The LLM4AD platform provides two vLLM-related classes:
LocalVLLMAPI: High-level API for deploying multiple LLMs across multiple GPUs
VLLMManager: Utility class for managing vLLM model deployments
The vLLM backend provides: - High-throughput inference using PagedAttention - Multi-GPU support with tensor parallelism - HTTP API server for inference requests
Setup Requirements#
Before using vLLM APIs, ensure you have the following:
Python Packages: Install the required Python packages:
pip install vllm flask flask-cors transformers requests
Hardware Requirements: - NVIDIA GPU with CUDA 11.8+ or 12.1+ - Sufficient GPU memory for the model (varies by model) - At least 16GB system RAM recommended
Model Files: Download your model (e.g., from Hugging Face):
# Example: Download Llama model # huggingface-cli download meta-llama/Llama-3.2-1B-Instruct
API Reference#
Class Definition: VLLMManager#
- class VLLMManager#
A manager class for deploying and controlling vLLM model instances.
Methods#
- __init__(self)#
Initialize the VLLM manager.
- Returns:
A new VLLMManager instance.
- Return type:
- deploy_models(self, model_path: str, tknz_path: str, gpus: List[int], ports: List[int], gpu_mem_utils: float | List[float] = None)#
Deploy vLLM models on specified GPUs.
- Parameters:
model_path (str) – Path to the pretrained model directory.
tknz_path (str) – Path to the tokenizer directory.
gpus (List[int]) – List of GPU indices to deploy models on.
ports (List[int]) – List of HTTP ports for each model instance. Must have the same length as
gpus.gpu_mem_utils (float | List[float], optional) – GPU memory utilization ratio. Can be a single float (applied to all GPUs) or a list of floats. Defaults to 0.85 (85% memory usage).
- Raises:
ValueError – If
len(gpus) != len(ports)
Note
Each GPU will run a separate vLLM instance with its own HTTP server. The instances can handle inference requests independently.
- release_resources(self)#
Terminate all vLLM model processes and release GPU resources.
This method: - Terminates all child processes - Clears CUDA cache - Frees GPU memory
Note
This is automatically called when the LocalVLLMAPI object is deleted.
- release_resources_(self)#
Legacy method for resource release. Use
release_resources()instead.
Class Definition: LocalVLLMAPI#
- class LocalVLLMAPI#
A concrete implementation of the LLM base class that uses vLLM for local inference.
Constructor#
- __init__(self, model_path: str, tknz_path: str, gpus: List[int], ports: List[int], **kwargs)#
Initialize and deploy vLLM models on local GPUs.
- Parameters:
model_path (str) – Path to the pretrained model directory.
tknz_path (str) – Path to the tokenizer directory.
gpus (List[int]) – List of GPU indices to deploy models on. Each GPU will run one model instance.
ports (List[int]) – List of HTTP ports for each model instance. Must correspond to
gpusone-to-one.kwargs – Additional keyword arguments (passed to parent LLM class).
Note
This constructor automatically: 1. Creates vLLM model instances on each specified GPU 2. Starts HTTP servers on each specified port 3. Creates a queue for load balancing across instances
Warning
Ensure no other service is using the specified ports.
Methods#
- draw_sample(self, prompt: str | Any, *args, **kwargs) str#
Generate a response from the vLLM model based on the provided prompt.
- Parameters:
prompt (str | Any) – The input prompt. Can be either: - A string containing the user message - A list of message dictionaries with ‘role’ and ‘content’ keys
args – Additional positional arguments (unused).
kwargs – Additional keyword arguments. Currently supports: -
temperature: Sampling temperature (default: 1.0) -top_p: Nucleus sampling (default: 1.0) -max_new_tokens: Maximum tokens to generate (default: 4096)
- Returns:
The generated text content from the LLM response.
- Return type:
str
Note
The method automatically performs load balancing across multiple GPU instances. If one instance fails, it automatically retries with another instance.
- _do_request(self, content: str, url: str) str#
Internal method to make HTTP request to vLLM server.
- Parameters:
content (str) – The prompt content.
url (str) – The HTTP endpoint URL.
- Returns:
The generated text.
- Return type:
str
- Raises:
Exception – If the request fails.
- close(self)#
Manually release resources. Called automatically when the object is deleted.
- __del__(self)#
Destructor that calls
close()to ensure resources are released.
Examples#
Example 1: Basic Single-GPU Deployment#
from llm4ad.tools.llm.local_vllm import LocalVLLMAPI
# Deploy model on single GPU
sampler = LocalVLLMAPI(
model_path="Llama-3.2-1B-Instruct",
tknz_path="Llama-3.2-1B-Instruct",
gpus=[0],
ports=[22001]
)
# Generate response
response = sampler.draw_sample("What is the meaning of life?")
print(response)
# Resources are automatically released when object is deleted
del sampler
Example 2: Multi-GPU Deployment for Load Balancing#
from llm4ad.tools.llm.local_vllm import LocalVLLMAPI
# Deploy on multiple GPUs for higher throughput
sampler = LocalVLLMAPI(
model_path="Llama-3.2-1B-Instruct",
tknz_path="Llama-3.2-1B-Instruct",
gpus=[0, 1, 2, 3], # Use 4 GPUs
ports=[22001, 22002, 22003, 22004] # Different port for each
)
# The sampler automatically balances requests across GPUs
for i in range(10):
response = sampler.draw_sample(f"Generate response number {i}")
print(f"Response {i}: {response[:50]}...")
Example 3: Customizing GPU Memory Utilization#
from llm4ad.tools.llm.local_vllm import LocalVLLMAPI
# Adjust memory utilization for each GPU
sampler = LocalVLLMAPI(
model_path="Llama-3.2-1B-Instruct",
tknz_path="Llama-3.2-1B-Instruct",
gpus=[0, 1],
ports=[22001, 22002],
gpu_mem_utils=[0.7, 0.9] # Different utilization per GPU
)
# Or use a single value for all GPUs
sampler = LocalVLLMAPI(
model_path="Llama-3.2-1B-Instruct",
tknz_path="Llama-3.2-1B-Instruct",
gpus=[0],
ports=[22001],
gpu_mem_utils=0.5 # Use only 50% of GPU memory
)
Example 4: Integration with LLM4AD Platform#
from llm4ad.tools.llm.local_vllm import LocalVLLMAPI
import llm4ad
# Create vLLM sampler
sampler = LocalVLLMAPI(
model_path="Llama-3.2-1B-Instruct",
tknz_path="Llama-3.2-1B-Instruct",
gpus=[0],
ports=[22001]
)
# Use with LLM4AD method
task = llm4ad.tasks.optimization.SymbolicRegression(
dimension=5,
num_samples=100,
eval_budget=1000
)
method = llm4ad.methods.eoh.EoH(
task=task,
sampler=sampler,
num_iterations=50
)
result = method.run()
# Explicitly release resources
sampler.close()
Example 5: Manual Resource Management#
from llm4ad.tools.llm.local_vllm import LocalVLLMAPI
sampler = LocalVLLMAPI(
model_path="Llama-3.2-1B-Instruct",
tknz_path="Llama-3.2-1B-Instruct",
gpus=[0],
ports=[22001]
)
try:
# Use the sampler
response = sampler.draw_sample("Hello, world!")
print(response)
finally:
# Always release resources explicitly
sampler.close()
ed Models#
The vLLM backend supports various models including:
LLaMA Series: LLaMA 2, LLaMA 3, LLaMA 3.2
Qwen Series: Qwen 2, Qwen 3
Mistral: Mistral 7B
Phi: Phi-3
Make sure to use models that have been converted/pulled for vLLM format.
Common Issues and Troubleshooting#
CUDA Out of Memory: Reduce
gpu_mem_utilsor use a smaller model.Port Already in Use: Change to a different port number.
Model Loading Fails: Ensure model path is correct and model is compatible with vLLM.
Slow Inference: Use tensor parallelism for larger models or optimize GPU settings.
HTTP Connection Error: Check firewall settings and ensure ports are accessible.
Resources Not Released: Always call
close()or use context manager pattern.
See Also#
OpenAI API - For OpenAI API integration
Ollama API - For local Ollama deployments
HTTPS API - For custom HTTPS API implementations