Covenant Labs

LM Lite

Multi-Model Batching Runtime

High-performance inference engine optimized for multi-model deployments. Integrated into Conduit.

Overview

LM Lite is Conduit's native inference runtime, designed for efficient multi-model execution on shared GPU infrastructure. It handles batch processing, replica management, and health-check orchestration.

Key differentiator: Run multiple models on single GPUs with minimal overhead—dramatically lower memory footprint than vLLM for small-to-medium model deployments.

Architecture

Batch Processing

LM Lite collects concurrent requests and processes them as batches:

ModelConfig(

model_id,

max_model_concurrency=50, # Max requests per batch

model_batch_execute_timeout_ms=1000, # Execute batch after timeout

)

Modes:

Single request: Lowest latency per request. Best for interactive applications.
Batch mode: Higher throughput, slightly higher latency. Best for pipeline workloads.

Replica Management

LM Lite distributes load across replicas using round-robin scheduling:

LMLiteBlock(

models=[...],

replicas=2, # Requests distributed across replicas

)

Health-Check Readiness

Traffic only routes to healthy nodes:

while not block.ready:

time.sleep(5)

# Container healthcheck passed, safe to send traffic

Memory Efficiency

The vLLM Overhead Problem

Traditional runtimes like vLLM carry ~15GB base overhead regardless of model size:

Runtime	1B Model	7B Model	Base Overhead
vLLM	~17GB	~29GB	~15GB
LM Lite	~2GB	~14GB	~1-2GB

For small models, vLLM uses 7-8x more memory than necessary.

Multi-Model Deployments

LM Lite runs multiple models simultaneously on single GPUs:

GPU Memory: 24GB

Model A (7B): ~14GB

Model B (1B): ~2GB

Model C (1B): ~2GB

LM Lite Overhead: ~2GB

Total: ~20GB ✓

vLLM requires separate instances per model. LM Lite shares infrastructure.

Configuration

from conduit import ModelConfig

from conduit.conduit_types import GPUS

from conduit.runtime import LMLiteBlock

block = LMLiteBlock(

models=[

ModelConfig(

"Qwen/Qwen3-4B",

max_model_len=1400,

max_model_concurrency=50,

model_batch_execute_timeout_ms=1000,

ModelConfig(

"microsoft/phi-3-mini",

max_model_len=2048,

max_model_concurrency=25,

model_batch_execute_timeout_ms=500,

gpu=GPUS.NVIDIA_L4,

replicas=2,

)

Parameters

Parameter	Type	Description
model_id	str	HuggingFace model identifier
max_model_len	int	Maximum tokens per request (input + output)
max_model_concurrency	int	Maximum concurrent requests per batch
model_batch_execute_timeout_ms	int	Milliseconds before batch executes, even if not full

Lifecycle

# Initialize

block = LMLiteBlock(models=[...], replicas=2)

# Clean stale instances

LMLiteBlock.gc()

# Wait for readiness

while not block.ready:

time.sleep(5)

# Execute inference

result = block(model_id="Qwen/Qwen3-4B", input=..., output=...)

# Cleanup

block.delete()

GPU Support

Current: NVIDIA CUDA GPUs

Roadmap: AMD ROCm support planned

Supported GPU Types

from conduit.conduit_types import GPUS

GPUS.NVIDIA_L4

GPUS.NVIDIA_A10

GPUS.NVIDIA_A100

# ... additional types

Specifications

Spec	Value
Base overhead	1-2GB
Model formats	HuggingFace Transformers
Optimal model size	1B-13B parameters
Multi-model	Yes, shared GPU
Streaming	Supported
Batching	Configurable

→ Conduit Framework → MDL Specification