Technical foundation for Covenant's sovereign AI infrastructure

LM Lite

Multi-Model Batching Runtime

High-performance inference engine optimized for multi-model deployments. Integrated into Conduit.


Overview

LM Lite is Conduit's native inference runtime, designed for efficient multi-model execution on shared GPU infrastructure. It handles batch processing, replica management, and health-check orchestration.

Key differentiator: Run multiple models on single GPUs with minimal overhead—dramatically lower memory footprint than vLLM for small-to-medium model deployments.


Architecture

Batch Processing

LM Lite collects concurrent requests and processes them as batches:

ModelConfig(

model_id,

max_model_concurrency=50, # Max requests per batch

model_batch_execute_timeout_ms=1000, # Execute batch after timeout

)

Modes:

  • Single request: Lowest latency per request. Best for interactive applications.
  • Batch mode: Higher throughput, slightly higher latency. Best for pipeline workloads.

Replica Management

LM Lite distributes load across replicas using round-robin scheduling:

LMLiteBlock(

models=[...],

replicas=2, # Requests distributed across replicas

)

Health-Check Readiness

Traffic only routes to healthy nodes:

while not block.ready:

time.sleep(5)

# Container healthcheck passed, safe to send traffic


Memory Efficiency

The vLLM Overhead Problem

Traditional runtimes like vLLM carry ~15GB base overhead regardless of model size:

Runtime1B Model7B ModelBase Overhead
vLLM~17GB~29GB~15GB
LM Lite~2GB~14GB~1-2GB

For small models, vLLM uses 7-8x more memory than necessary.

Multi-Model Deployments

LM Lite runs multiple models simultaneously on single GPUs:

GPU Memory: 24GB

Model A (7B): ~14GB

Model B (1B): ~2GB

Model C (1B): ~2GB

LM Lite Overhead: ~2GB

Total: ~20GB

vLLM requires separate instances per model. LM Lite shares infrastructure.


Configuration

from conduit import ModelConfig

from conduit.conduit_types import GPUS

from conduit.runtime import LMLiteBlock

block = LMLiteBlock(

models=[

ModelConfig(

"Qwen/Qwen3-4B",

max_model_len=1400,

max_model_concurrency=50,

model_batch_execute_timeout_ms=1000,

),

ModelConfig(

"microsoft/phi-3-mini",

max_model_len=2048,

max_model_concurrency=25,

model_batch_execute_timeout_ms=500,

),

],

gpu=GPUS.NVIDIA_L4,

replicas=2,

)

Parameters

ParameterTypeDescription
model_idstrHuggingFace model identifier
max_model_lenintMaximum tokens per request (input + output)
max_model_concurrencyintMaximum concurrent requests per batch
model_batch_execute_timeout_msintMilliseconds before batch executes, even if not full

Lifecycle

# Initialize

block = LMLiteBlock(models=[...], replicas=2)

# Clean stale instances

LMLiteBlock.gc()

# Wait for readiness

while not block.ready:

time.sleep(5)

# Execute inference

result = block(model_id="Qwen/Qwen3-4B", input=..., output=...)

# Cleanup

block.delete()


GPU Support

Current: NVIDIA CUDA GPUs

Roadmap: AMD ROCm support planned

Supported GPU Types

from conduit.conduit_types import GPUS

GPUS.NVIDIA_L4

GPUS.NVIDIA_A10

GPUS.NVIDIA_A100

# ... additional types


Specifications

SpecValue
Base overhead1-2GB
Model formatsHuggingFace Transformers
Optimal model size1B-13B parameters
Multi-modelYes, shared GPU
StreamingSupported
BatchingConfigurable

→ Conduit Framework→ MDL Specification