LM Lite
Multi-Model Batching Runtime
High-performance inference engine optimized for multi-model deployments. Integrated into Conduit.
Overview
LM Lite is Conduit's native inference runtime, designed for efficient multi-model execution on shared GPU infrastructure. It handles batch processing, replica management, and health-check orchestration.
Key differentiator: Run multiple models on single GPUs with minimal overhead—dramatically lower memory footprint than vLLM for small-to-medium model deployments.
Architecture
Batch Processing
LM Lite collects concurrent requests and processes them as batches:
ModelConfig(
model_id,
max_model_concurrency=50, # Max requests per batch
model_batch_execute_timeout_ms=1000, # Execute batch after timeout
)
Modes:
- Single request: Lowest latency per request. Best for interactive applications.
- Batch mode: Higher throughput, slightly higher latency. Best for pipeline workloads.
Replica Management
LM Lite distributes load across replicas using round-robin scheduling:
LMLiteBlock(
models=[...],
replicas=2, # Requests distributed across replicas
)
Health-Check Readiness
Traffic only routes to healthy nodes:
while not block.ready:
time.sleep(5)
# Container healthcheck passed, safe to send traffic
Memory Efficiency
The vLLM Overhead Problem
Traditional runtimes like vLLM carry ~15GB base overhead regardless of model size:
| Runtime | 1B Model | 7B Model | Base Overhead |
|---|---|---|---|
| vLLM | ~17GB | ~29GB | ~15GB |
| LM Lite | ~2GB | ~14GB | ~1-2GB |
For small models, vLLM uses 7-8x more memory than necessary.
Multi-Model Deployments
LM Lite runs multiple models simultaneously on single GPUs:
GPU Memory: 24GB
Model A (7B): ~14GB
Model B (1B): ~2GB
Model C (1B): ~2GB
LM Lite Overhead: ~2GB
Total: ~20GB ✓
vLLM requires separate instances per model. LM Lite shares infrastructure.
Configuration
from conduit import ModelConfig
from conduit.conduit_types import GPUS
from conduit.runtime import LMLiteBlock
block = LMLiteBlock(
models=[
ModelConfig(
"Qwen/Qwen3-4B",
max_model_len=1400,
max_model_concurrency=50,
model_batch_execute_timeout_ms=1000,
),
ModelConfig(
"microsoft/phi-3-mini",
max_model_len=2048,
max_model_concurrency=25,
model_batch_execute_timeout_ms=500,
),
],
gpu=GPUS.NVIDIA_L4,
replicas=2,
)
Parameters
| Parameter | Type | Description |
|---|---|---|
| model_id | str | HuggingFace model identifier |
| max_model_len | int | Maximum tokens per request (input + output) |
| max_model_concurrency | int | Maximum concurrent requests per batch |
| model_batch_execute_timeout_ms | int | Milliseconds before batch executes, even if not full |
Lifecycle
# Initialize
block = LMLiteBlock(models=[...], replicas=2)
# Clean stale instances
LMLiteBlock.gc()
# Wait for readiness
while not block.ready:
time.sleep(5)
# Execute inference
result = block(model_id="Qwen/Qwen3-4B", input=..., output=...)
# Cleanup
block.delete()
GPU Support
Current: NVIDIA CUDA GPUs
Roadmap: AMD ROCm support planned
Supported GPU Types
from conduit.conduit_types import GPUS
GPUS.NVIDIA_L4
GPUS.NVIDIA_A10
GPUS.NVIDIA_A100
# ... additional types
Specifications
| Spec | Value |
|---|---|
| Base overhead | 1-2GB |
| Model formats | HuggingFace Transformers |
| Optimal model size | 1B-13B parameters |
| Multi-model | Yes, shared GPU |
| Streaming | Supported |
| Batching | Configurable |

