vLLM API#

The vLLM APIs provide an interface to deploy and interact with large language models locally using the vLLM inference engine. vLLM offers high-throughput inference with optimized serving.

Overview#

The LLM4AD platform provides two vLLM-related classes:

  1. LocalVLLMAPI: High-level API for deploying multiple LLMs across multiple GPUs

  2. VLLMManager: Utility class for managing vLLM model deployments

The vLLM backend provides: - High-throughput inference using PagedAttention - Multi-GPU support with tensor parallelism - HTTP API server for inference requests

Setup Requirements#

Before using vLLM APIs, ensure you have the following:

  1. Python Packages: Install the required Python packages:

    pip install vllm flask flask-cors transformers requests
    
  2. Hardware Requirements: - NVIDIA GPU with CUDA 11.8+ or 12.1+ - Sufficient GPU memory for the model (varies by model) - At least 16GB system RAM recommended

  3. Model Files: Download your model (e.g., from Hugging Face):

    # Example: Download Llama model
    # huggingface-cli download meta-llama/Llama-3.2-1B-Instruct
    

API Reference#

Class Definition: VLLMManager#

class VLLMManager#

A manager class for deploying and controlling vLLM model instances.

Methods#

__init__(self)#

Initialize the VLLM manager.

Returns:

A new VLLMManager instance.

Return type:

VLLMManager

deploy_models(self, model_path: str, tknz_path: str, gpus: List[int], ports: List[int], gpu_mem_utils: float | List[float] = None)#

Deploy vLLM models on specified GPUs.

Parameters:
  • model_path (str) – Path to the pretrained model directory.

  • tknz_path (str) – Path to the tokenizer directory.

  • gpus (List[int]) – List of GPU indices to deploy models on.

  • ports (List[int]) – List of HTTP ports for each model instance. Must have the same length as gpus.

  • gpu_mem_utils (float | List[float], optional) – GPU memory utilization ratio. Can be a single float (applied to all GPUs) or a list of floats. Defaults to 0.85 (85% memory usage).

Raises:

ValueError – If len(gpus) != len(ports)

Note

Each GPU will run a separate vLLM instance with its own HTTP server. The instances can handle inference requests independently.

release_resources(self)#

Terminate all vLLM model processes and release GPU resources.

This method: - Terminates all child processes - Clears CUDA cache - Frees GPU memory

Note

This is automatically called when the LocalVLLMAPI object is deleted.

release_resources_(self)#

Legacy method for resource release. Use release_resources() instead.

Class Definition: LocalVLLMAPI#

class LocalVLLMAPI#

A concrete implementation of the LLM base class that uses vLLM for local inference.

Constructor#

__init__(self, model_path: str, tknz_path: str, gpus: List[int], ports: List[int], **kwargs)#

Initialize and deploy vLLM models on local GPUs.

Parameters:
  • model_path (str) – Path to the pretrained model directory.

  • tknz_path (str) – Path to the tokenizer directory.

  • gpus (List[int]) – List of GPU indices to deploy models on. Each GPU will run one model instance.

  • ports (List[int]) – List of HTTP ports for each model instance. Must correspond to gpus one-to-one.

  • kwargs – Additional keyword arguments (passed to parent LLM class).

Note

This constructor automatically: 1. Creates vLLM model instances on each specified GPU 2. Starts HTTP servers on each specified port 3. Creates a queue for load balancing across instances

Warning

Ensure no other service is using the specified ports.

Methods#

draw_sample(self, prompt: str | Any, *args, **kwargs) str#

Generate a response from the vLLM model based on the provided prompt.

Parameters:
  • prompt (str | Any) – The input prompt. Can be either: - A string containing the user message - A list of message dictionaries with ‘role’ and ‘content’ keys

  • args – Additional positional arguments (unused).

  • kwargs – Additional keyword arguments. Currently supports: - temperature: Sampling temperature (default: 1.0) - top_p: Nucleus sampling (default: 1.0) - max_new_tokens: Maximum tokens to generate (default: 4096)

Returns:

The generated text content from the LLM response.

Return type:

str

Note

The method automatically performs load balancing across multiple GPU instances. If one instance fails, it automatically retries with another instance.

_do_request(self, content: str, url: str) str#

Internal method to make HTTP request to vLLM server.

Parameters:
  • content (str) – The prompt content.

  • url (str) – The HTTP endpoint URL.

Returns:

The generated text.

Return type:

str

Raises:

Exception – If the request fails.

close(self)#

Manually release resources. Called automatically when the object is deleted.

__del__(self)#

Destructor that calls close() to ensure resources are released.

Examples#

Example 1: Basic Single-GPU Deployment#

from llm4ad.tools.llm.local_vllm import LocalVLLMAPI

# Deploy model on single GPU
sampler = LocalVLLMAPI(
    model_path="Llama-3.2-1B-Instruct",
    tknz_path="Llama-3.2-1B-Instruct",
    gpus=[0],
    ports=[22001]
)

# Generate response
response = sampler.draw_sample("What is the meaning of life?")
print(response)

# Resources are automatically released when object is deleted
del sampler

Example 2: Multi-GPU Deployment for Load Balancing#

from llm4ad.tools.llm.local_vllm import LocalVLLMAPI

# Deploy on multiple GPUs for higher throughput
sampler = LocalVLLMAPI(
    model_path="Llama-3.2-1B-Instruct",
    tknz_path="Llama-3.2-1B-Instruct",
    gpus=[0, 1, 2, 3],        # Use 4 GPUs
    ports=[22001, 22002, 22003, 22004]  # Different port for each
)

# The sampler automatically balances requests across GPUs
for i in range(10):
    response = sampler.draw_sample(f"Generate response number {i}")
    print(f"Response {i}: {response[:50]}...")

Example 3: Customizing GPU Memory Utilization#

from llm4ad.tools.llm.local_vllm import LocalVLLMAPI

# Adjust memory utilization for each GPU
sampler = LocalVLLMAPI(
    model_path="Llama-3.2-1B-Instruct",
    tknz_path="Llama-3.2-1B-Instruct",
    gpus=[0, 1],
    ports=[22001, 22002],
    gpu_mem_utils=[0.7, 0.9]  # Different utilization per GPU
)

# Or use a single value for all GPUs
sampler = LocalVLLMAPI(
    model_path="Llama-3.2-1B-Instruct",
    tknz_path="Llama-3.2-1B-Instruct",
    gpus=[0],
    ports=[22001],
    gpu_mem_utils=0.5  # Use only 50% of GPU memory
)

Example 4: Integration with LLM4AD Platform#

from llm4ad.tools.llm.local_vllm import LocalVLLMAPI
import llm4ad

# Create vLLM sampler
sampler = LocalVLLMAPI(
    model_path="Llama-3.2-1B-Instruct",
    tknz_path="Llama-3.2-1B-Instruct",
    gpus=[0],
    ports=[22001]
)

# Use with LLM4AD method
task = llm4ad.tasks.optimization.SymbolicRegression(
    dimension=5,
    num_samples=100,
    eval_budget=1000
)

method = llm4ad.methods.eoh.EoH(
    task=task,
    sampler=sampler,
    num_iterations=50
)

result = method.run()

# Explicitly release resources
sampler.close()

Example 5: Manual Resource Management#

from llm4ad.tools.llm.local_vllm import LocalVLLMAPI

sampler = LocalVLLMAPI(
    model_path="Llama-3.2-1B-Instruct",
    tknz_path="Llama-3.2-1B-Instruct",
    gpus=[0],
    ports=[22001]
)

try:
    # Use the sampler
    response = sampler.draw_sample("Hello, world!")
    print(response)
finally:
    # Always release resources explicitly
    sampler.close()

ed Models#

The vLLM backend supports various models including:

  • LLaMA Series: LLaMA 2, LLaMA 3, LLaMA 3.2

  • Qwen Series: Qwen 2, Qwen 3

  • Mistral: Mistral 7B

  • Phi: Phi-3

Make sure to use models that have been converted/pulled for vLLM format.

Common Issues and Troubleshooting#

  1. CUDA Out of Memory: Reduce gpu_mem_utils or use a smaller model.

  2. Port Already in Use: Change to a different port number.

  3. Model Loading Fails: Ensure model path is correct and model is compatible with vLLM.

  4. Slow Inference: Use tensor parallelism for larger models or optimize GPU settings.

  5. HTTP Connection Error: Check firewall settings and ensure ports are accessible.

  6. Resources Not Released: Always call close() or use context manager pattern.

See Also#