Architect Guide — LINUS-AI

Distributed inference topologies, tensor and pipeline parallelism, mesh networking, and performance engineering.

System Overview Inference Engine Tensor Parallelism 🔗Pipeline Parallelism 🕸Mesh Networking 🔒Vault & Storage Audit Ledger 🌡Thermal Mgmt 🛡Guardian Process 📈Performance Tuning Scalability Limits 🔐Security Architecture

System Architecture Overview

LINUS-AI ships as a single statically-linked binary that embeds every subsystem needed for private AI inference. There are no runtime service dependencies, no daemon managers, and no network calls to external infrastructure after licence activation.

Component Diagram

linus-ai — internal component layout
┌─────────────────────────────────────────────────────────────────┐
│                         linus-ai binary                         │
│                                                                 │
│  ┌──────────────────────┐   ┌──────────────────────────────┐   │
│  │    HTTP Server       │   │      Agent Engine            │   │
│  │  (Axum / FastAPI)    │◄──│  (tool-use, RAG, planning)   │   │
│  │  port 8080           │   └──────────────┬───────────────┘   │
│  └──────────┬───────────┘                  │                   │
│             │ request                       │ inference call    │
│             ▼                              ▼                   │
│  ┌──────────────────────────────────────────────────────────┐  │
│  │               Inference Engine                           │  │
│  │          (linus-ai-inference crate / llama.cpp FFI)      │  │
│  │  continuous batching · KV cache · flash attention        │  │
│  │  speculative decoding · GGUF loader · quantisation       │  │
│  └──────────────────────────────────────────────────────────┘  │
│                                                                 │
│  ┌─────────────┐  ┌──────────────┐  ┌────────────────────┐    │
│  │    Vault    │  │  Mesh Node   │  │  Blockchain Audit  │    │
│  │ AES-256-GCM │  │ linus-ai-net │  │   (audit.chain)    │    │
│  │ HKDF key    │  │ mTLS · QUIC  │  │   hash-chained     │    │
│  └─────────────┘  └──────────────┘  └────────────────────┘    │
│                                                                 │
│  ┌─────────────────────────┐  ┌────────────────────────────┐  │
│  │    Thermal Monitor      │  │    Guardian Process        │  │
│  │  linus-ai-thermal crate │  │  linus-ai-guardian crate   │  │
│  │  NVML · sysfs · IOKit   │  │  watchdog · OOM · upgrade  │  │
│  └─────────────────────────┘  └────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Data Flow

request lifecycle
User Request (HTTP POST /v1/chat/completions)
        │
        ▼
  HTTP Layer (Axum)
  ├─ auth check (API key / no-auth localhost)
  ├─ rate limiting
  └─ route to handler
        │
        ▼
  Agent Engine
  ├─ tool-use planning (if agentic mode)
  ├─ RAG retrieval from Vault (semantic search over HNSW)
  └─ build prompt context
        │
        ▼
  Inference Engine
  ├─ tokenise input
  ├─ prefill forward pass  ← compute-bound
  ├─ KV cache store
  └─ decode loop (token-by-token) ← memory-bandwidth-bound
        │
        ▼
  Response Stream (SSE / chunked JSON)
  ├─ tokens streamed as generated
  ├─ audit record appended (token counts only)
  └─ vault store (if memory enabled)

Key Design Principles

01

Single Binary

The entire stack — HTTP server, inference engine, vault, mesh, audit, thermal monitor, and guardian — compiles into one statically-linked binary. Deployment is a file copy. No dependency managers, no container runtimes, no separate services to keep in sync.

02

Zero-Cloud Default

After the one-time licence activation (which contacts the licence server over HTTPS), LINUS-AI operates fully air-gapped. All model weights, embeddings, conversation history, and generated text remain on the host hardware. No telemetry callbacks, no model API proxies, no logging to remote endpoints.

03

Privacy-First

Conversation content is never written to disk in plaintext. Vault entries are encrypted before storage. The audit ledger records only metadata (timestamps, token counts) — never prompt or response content. Key material is derived fresh at startup from hardware identifiers and never persisted.

Inference Engine

The inference engine wraps llama.cpp through the linus-ai-inference Rust crate, exposing a safe async API over an unsafe FFI boundary. The crate handles GGUF model loading, context management, and generation scheduling.

llama.cpp Integration (Rust FFI)

The linus-ai-inference crate uses a thin unsafe extern "C" block to call into the llama.cpp C API. The Rust wrapper adds ownership semantics, error propagation via Result<T, InferenceError>, and async compatibility through tokio::task::spawn_blocking for the CPU-intensive forward pass. CUDA kernels run synchronously on dedicated streams — the blocking wrapper keeps the tokio event loop free.

linus-ai-inference — Rust FFI sketch
// Simplified — actual crate has full lifetime management
extern "C" {
fn llama_load_model_from_file(path: *const c_char, params: LlamaModelParams) -> *mut LlamaModel;
fn llama_new_context_with_model(model: *mut LlamaModel, params: LlamaContextParams) -> *mut LlamaContext;
fn llama_decode(ctx: *mut LlamaContext, batch: LlamaBatch) -> c_int;
fn llama_sampling_sample(ctx: *mut LlamaSamplingContext, main_ctx: *mut LlamaContext, idx: c_int) -> LlamaToken;
}
 
pub async fn generate(model: Arc<Model>, prompt: &str, params: GenParams)
-> impl Stream<Item = Token> {
spawn_blocking(move || { /* llama_decode loop */ }).await
}

GGUF Model Format

GGUF (GPT-Generated Unified Format) stores model weights, tokeniser vocabulary, and metadata in a single file with a self-describing header. LINUS-AI supports all quantisation levels that llama.cpp exposes: Q2_K, Q3_K_S/M/L, Q4_0, Q4_K_S/M, Q5_K_S/M, Q6_K, Q8_0, and F16/BF16. GGUF tensors can be memory-mapped (mmap) from disk, avoiding a full load before the first token is produced.

Continuous Batching

Under a static batching scheme each generation request occupies one full forward pass. Continuous batching (also called in-flight batching) allows the engine to insert new requests into an ongoing decode step, sharing the forward pass across multiple sequences simultaneously. This dramatically improves GPU utilisation when there is a mix of short and long requests in the queue.

continuous batching — scheduler config
# config.toml
[inference]
continuous_batching = true
max_batch_size = 64 # max concurrent sequences in one step
scheduler = "fcfs" # first-come-first-served | priority

KV Cache

The key-value cache stores attention keys and values for all previously computed tokens, avoiding redundant recomputation during the decode phase. Cache size in bytes:

KV_cache_bytes = 2 × n_layers × n_heads × head_dim × context_len × dtype_bytes

For a 70B Llama-3 model (80 layers, 64 KV heads, head_dim=128, context 8192, fp16):
2 × 80 × 64 × 128 × 8192 × 2 bytes = ~21.5 GB

Tip: When the KV cache overflows VRAM it spills to RAM via kv_cache_type = "offload". Offloaded cache incurs a PCIe round-trip per decode step. Keep context windows as small as the workload allows to maximise VRAM residency.

Speculative Decoding

Speculative decoding pairs a small fast draft model with the main model. The draft model autoregressively generates K tokens in a single pass; the main model then verifies all K tokens in one forward pass (parallel, not sequential). Accepted tokens are kept; the first rejected token causes a rollback. Typical accept rates of 80–90% yield 2–4× throughput improvements on memory-bandwidth-bound workloads.

speculative decoding config
# config.toml
[inference]
speculative_decoding = true
draft_model = "llama3.2-1b" # small fast drafter
draft_steps = 6 # tokens to speculate per step
model = "llama3-70b" # main verifier model

Flash Attention

Flash Attention (Tri Dao et al.) reorders attention computation to be IO-aware, avoiding materialisation of the full N×N attention matrix in VRAM. Instead, the computation is tiled and fused into a single CUDA kernel, reducing memory reads/writes from O(N²) to O(N). Enabled by default when the CUDA backend is active and the GPU has compute capability ≥ 8.0 (Ampere+). Set flash_attention = false to disable.

Memory Layout

Component Location Notes
Model weights VRAM primary Quantised tensors loaded into GPU memory. Overflow layers CPU-offloaded via gpu_layers.
KV cache VRAM primary Spills to pinned RAM when VRAM exhausted. Configurable via kv_cache_type.
Activations VRAM (transient) Held only during forward pass. Flash Attention reduces peak activation memory.
Tokeniser vocab RAM Small (<50 MB). Always CPU-resident.
GGUF file Disk (mmap) Memory-mapped; OS page cache handles eviction.

Tensor Parallelism

Tensor parallelism (TP) splits the weight matrices of each transformer layer across N GPUs. Each GPU holds a vertical shard (column-wise partition) of the weight matrix and computes its partial result independently. The partial results are summed via an all-reduce collective after each linear layer, producing the same output as if the full matrix were on one device.

How It Works

column-wise tensor split — 4× GPU
Full weight W [d_model × d_ff]   →   Split column-wise across 4 GPUs:

GPU 0: W[:,   0 :  d_ff/4]   ×  X  →  partial_0
GPU 1: W[:,  d_ff/4:d_ff/2]  ×  X  →  partial_1
GPU 2: W[:,d_ff/2:3*d_ff/4]  ×  X  →  partial_2
GPU 3: W[:,3*d_ff/4:d_ff  ]  ×  X  →  partial_3

AllReduce (sum): output = partial_0 + partial_1 + partial_2 + partial_3

Each GPU holds 1/N of each weight matrix.
AllReduce bandwidth ∝ 2 × (N-1)/N × message_size per layer.

Configuration

~/.linus_ai/config.toml — tensor parallel
# 4× GPU tensor parallelism on a single node
[inference]
backend = "cuda"
tensor_parallel = 4 # degree: 1 | 2 | 4 | 8
gpu_ids = [0,1,2,3] # GPU indices (must match degree)
model = "llama3-70b"
context_len = 8192
 
$ linus-ai --serve --tensor-parallel 4 --gpu-ids 0,1,2,3
✓ Tensor parallel: 4 shards across GPU 0,1,2,3
✓ Weight per GPU: ~17.5B params (70B / 4)
✓ KV cache: 4 independent shards (each GPU stores head subset)
✓ API server: http://0.0.0.0:8080

Bandwidth Requirements

Interconnect Bandwidth (bidirectional) Max TP Degree Recommended Use
NVLink 4.0 1.8 TB/s 8 H100 SXM clusters — optimal for all TP degrees
NVLink 3.0 600 GB/s 8 A100 SXM clusters — excellent for TP ≤ 8
PCIe 5.0 × 16 128 GB/s 4 Consumer / workstation GPUs — TP ≤ 4, watch latency
PCIe 4.0 × 16 64 GB/s 2 TP 2 only. AllReduce becomes bottleneck at TP 4+

RPC Mode vs Native CUDA Mode

Native CUDA mode is used when all GPUs are on the same host. NCCL handles the AllReduce collective directly over NVLink or PCIe. No extra network stack is involved.

RPC mode (handled by the linus-ai-net crate) extends tensor parallelism across hosts by serialising partial activations and shipping them over QUIC. The bandwidth requirement is the same as native mode — RPC mode is intended for specialised InfiniBand or 400G Ethernet clusters, not commodity networks. Use pipeline parallelism for multi-node deployments on standard LAN.

Memory Savings

With TP=4, each GPU holds 1/4 of model weights plus a full shard of the KV cache (each GPU handles a subset of attention heads). Peak VRAM per GPU drops from the full model size to approximately model_bytes / N + kv_cache_bytes / N. At TP=8 on H100s, a 405B parameter model (810 GB in fp16) becomes ~101 GB per GPU — within the 80 GB per-GPU limit with fp8 quantisation.

When to Use Tensor Parallelism

Single node Multiple GPUs NVLink preferred PCIe acceptable at TP≤4 High VRAM headroom needed

Pipeline Parallelism

Pipeline parallelism (PP) partitions the transformer layers vertically: stage 0 holds the embedding layer and layers 0–K, stage 1 holds layers K+1–2K, and so on. Each stage processes one micro-batch, then passes its activations to the next stage while immediately starting work on the next micro-batch. The pipeline is filled by splitting each request into smaller micro-batches.

How It Works

3-stage pipeline — layer distribution
Node A (Stage 0) │  Node B (Stage 1) │  Node C (Stage 2)
─────────────────┼───────────────────┼──────────────────
 embed + L0-L15  │     L16-L31       │    L32-L47 + head
                 │                   │
Time →   [μB0]──▶[μB0]──▶[μB0]──▶ out
               [μB1]──▶[μB1]──▶[μB1]──▶ out
                    [μB2]──▶[μB2]──▶[μB2]──▶ out
                         [μB3]──▶[μB3]──▶[μB3]──▶ out

Pipeline bubble = (stages - 1) / (micro_batches + stages - 1)
Reduce bubble by increasing micro_batch_count.

Stage Assignment Algorithm

The default stage assignment distributes layers so that each stage has approximately equal parameter count, accounting for the embedding table (which lives on stage 0) and the LM head (stage N-1). Heterogeneous hardware is supported: nodes with lower peak TFLOPS receive fewer layers proportionally. The assignment is computed at cluster startup and re-emitted in the startup log.

pipeline parallel config — 3 nodes
# config.toml on the coordinator node
[inference]
pipeline_parallel = 3
pipeline_stages = ["0-15", "16-31", "32-47"] # manual override (optional)
micro_batch_size = 8 # sequences per micro-batch
model = "llama3-70b"
 
$ linus-ai --serve --pipeline-parallel 3 --mesh-role coordinator
✓ Stage 0 → Node A (192.168.1.10): layers 0-15 [embed + 16 layers]
✓ Stage 1 → Node B (192.168.1.11): layers 16-31 [16 layers]
✓ Stage 2 → Node C (192.168.1.12): layers 32-47 [16 layers + head]
✓ Pipeline efficiency: ~85% at micro_batch_size=8

Inter-Stage Communication

Transport Use Case Latency Config
Unix socket Stages on same host <1 µs transport = "unix"
TCP LAN (same rack / datacenter) ~100 µs transport = "tcp"
QUIC WAN or unreliable networks ~5 ms transport = "quic"

Bubble Overhead and Micro-Batching

The pipeline bubble is idle time at the beginning and end of each batch where some stages have no work. With micro_batch_size = 1 the bubble fraction is (stages-1)/stages — very large. Increasing micro_batch_size amortises the bubble at the cost of higher end-to-end latency per request. Tune based on your throughput-vs-latency requirements.

bubble_fraction = (stages - 1) / (micro_batches_in_flight + stages - 1)

Heterogeneous Hardware Support

If Node A has 2× A100 80GB and Node B has 1× RTX 4090 24GB, set pipeline_stages = ["0-40", "41-47"] to assign 40 layers to the faster node and only 7 to the slower one. The coordinator monitors per-stage completion times and can emit warnings if a stage becomes a bottleneck.

When to Use Pipeline Parallelism

Multiple machines Model too large for single node Heterogeneous hardware Higher latency than TP Can combine with TP (TP within node, PP across nodes)

Mesh Networking

The linus-ai-net crate implements a P2P encrypted overlay network that allows multiple LINUS-AI nodes to form an inference cluster without a centralised broker or VPN. Nodes communicate directly over authenticated, encrypted channels.

Discovery

  • LAN (mDNS) Nodes broadcast _linusai._tcp.local mDNS records. Other nodes on the same L2 segment discover them automatically within seconds. No configuration required.
  • WAN (static peers) Set peers = ["10.0.0.2:9090", "10.0.0.3:9090"] in [mesh] config. Nodes connect on startup and keep connections alive with heartbeats.
  • Kubernetes Use the headless service DNS records as peer list. Each pod's DNS name resolves to pod IP. Set peers = ["linus-ai-0.linus-ai.default.svc:9090", ...].

Mutual TLS

Each node generates an Ed25519 keypair and a self-signed X.509 certificate on first run, stored in ~/.linus_ai/mesh/. When two nodes connect, they perform a full mTLS handshake — both sides authenticate each other. The coordinator's public key fingerprint is used as the cluster identity: workers that present a different coordinator cert are rejected.

Node Roles

Role Responsibilities Exposes HTTP?
coordinator Accepts client requests, orchestrates layer assignment, routes activation tensors between pipeline stages, monitors worker liveness. Yes — port 8080
worker Receives activation tensors from the upstream stage, runs the assigned layers, sends activations to the downstream stage. No — mesh port only

Fault Tolerance

The coordinator sends heartbeat pings to each worker every 5 seconds. A worker that fails to respond within 3 consecutive heartbeat intervals (15 s default) is marked dead. The coordinator then reassigns that worker's layers to the remaining workers, redistributing the model partition map. In-flight requests that were routed through the dead worker are cancelled and retried. No manual intervention required.

Topology

cluster topology diagram
Within one cluster: full mesh (each worker ↔ every other worker)

  Worker A ──── Worker B
     │  ╲      ╱  │
     │   ╲    ╱   │
     │    ╲  ╱    │
  Worker C ──── Worker D
           │
      Coordinator (exposes port 8080 to clients)

Between clusters: star topology

  Cluster 1           Cluster 2
  ┌─────────┐         ┌─────────┐
  │ Coord 1 │◄───────►│ Coord 2 │
  └─────────┘         └─────────┘
       ▲                    ▲
  workers...           workers...

Wire Protocol

LINUS-AI uses a custom binary protocol over QUIC (primary) or TCP (fallback). Each message has a fixed 16-byte header: [magic(4)] [version(2)] [msg_type(2)] [payload_len(8)], followed by a length-prefixed payload. Activation tensors are transmitted in their raw binary representation without re-serialisation overhead. Port 9090 is the default (configurable via listen_port).

mesh setup — coordinator + 2 workers
# Node 1 (192.168.1.10) — coordinator
$ linus-ai --serve \
--mesh-role coordinator \
--mesh-port 9090 \
--pipeline-parallel 3
 
# Node 2 (192.168.1.11) — worker
$ linus-ai --mesh-role worker --mesh-join 192.168.1.10:9090
 
# Node 3 (192.168.1.12) — worker
$ linus-ai --mesh-role worker --mesh-join 192.168.1.10:9090
 
✓ Mesh cluster formed: 1 coordinator + 2 workers
✓ mTLS handshake: all nodes authenticated
✓ Pipeline: 3 stages assigned automatically

Vault & Encrypted Storage

The vault provides persistent, searchable, end-to-end encrypted memory for LINUS-AI. Conversation summaries, user preferences, and knowledge snippets are stored encrypted and retrieved via semantic similarity search — without the vault key ever touching disk.

Key Derivation

On startup, LINUS-AI collects three hardware identifiers: CPUID (processor model string), MAC address (first non-loopback interface), and hostname. These are concatenated and fed into HKDF-SHA256 with a fixed salt to produce a 256-bit AES key. The key exists only in process memory. No key file is created.

vault_key = HKDF-SHA256(ikm = CPUID ‖ MAC ‖ hostname, salt = "linus-ai-vault-v1", info = "")

Encryption

  • Algorithm AES-256-GCM. Each record has its own 96-bit random nonce. The nonce is stored alongside the ciphertext.
  • Auth tag 128-bit GCM authentication tag appended after ciphertext. Tampering with any byte causes authentication failure.
  • AAD Record ID and timestamp are used as Additional Authenticated Data, preventing record transplantation attacks.

Storage Format

Encrypted blobs are stored in a SQLite database (~/.linus_ai/vault.db) using Write-Ahead Logging (WAL) mode for concurrent read safety. The schema has two tables: records (id, nonce, ciphertext, tag, created_at) and embeddings (id, vector BLOB). Embeddings are stored in plaintext — they are dense floating-point vectors that encode semantic meaning without containing readable text.

Semantic Index (HNSW)

The vault maintains an HNSW (Hierarchical Navigable Small World) graph index over embedding vectors, enabling approximate nearest-neighbour search in sub-millisecond time across millions of records. The HNSW graph is built in memory at startup from the embeddings table and updated incrementally as new records are inserted. The index is not encrypted because embeddings do not contain reconstructable content.

Export / Import

vault export and import
# Export vault to a portable encrypted archive
$ linus-ai --vault-export vault-backup.lv
Enter export passphrase: ████████████
✓ Vault exported: vault-backup.lv (1,247 records, 42.3 MB)
 
# Import vault on a new machine
$ linus-ai --vault-import vault-backup.lv
Enter import passphrase: ████████████
✓ Vault imported: 1,247 records
✓ HNSW index rebuilt in 340ms
Warning: Vault export re-encrypts data with a user-supplied passphrase for portability. The export file is encrypted but the hardware binding is removed. Protect export files as you would a password database.

Blockchain Audit Ledger

The audit ledger (Team and Enterprise tiers) provides a tamper-evident, append-only log of AI interactions for compliance and governance. It proves that interactions occurred and characterises them — without storing any sensitive content.

Purpose

Regulated industries (finance, healthcare, legal) require evidence that AI outputs were reviewed, that the correct model was used, and that usage volumes are auditable. The ledger satisfies these requirements while preserving user privacy: prompt text and response text are never written to the ledger.

Block Structure

audit block — binary layout
struct AuditBlock {
    block_index : u64,         // monotonically increasing
    timestamp   : i64,         // Unix epoch, nanoseconds
    session_id  : [u8; 16],    // UUIDv4, per-conversation
    model       : [u8; 64],    // model name, null-padded
    prompt_tokens  : u32,      // input token count
    completion_tokens: u32,    // output token count
    user_id     : [u8; 32],    // SHA256(username), never plaintext
    prev_hash   : [u8; 32],    // SHA256 of previous block bytes
    block_hash  : [u8; 32],    // SHA256(all fields above)
}
// Total: 204 bytes per block. NO prompt or response content.

Storage and Verification

  • Location ~/.linus_ai/audit.chain — raw binary, Merkle tree structure. File grows at ~204 bytes per request.
  • Integrity check Recompute SHA256 of each block and verify block_hash == SHA256(block_bytes) and prev_hash == previous block_hash. Any modification invalidates all subsequent blocks.
audit chain — verify and export
$ linus-ai --verify-audit
Verifying audit chain: ~/.linus_ai/audit.chain
✓ 14,237 blocks verified in 82ms
✓ Chain integrity: VALID
 
$ linus-ai --export-audit audit-q1-2026.json
✓ Exported 14,237 records to audit-q1-2026.json
Fields: block_index, timestamp, session_id, model,
prompt_tokens, completion_tokens, user_id, block_hash

Thermal Management

The linus-ai-thermal crate polls GPU and CPU temperatures on a configurable interval (default 2 s) and applies throttling policies to prevent hardware damage during sustained inference workloads.

Temperature Sources

Platform GPU Source CPU Source
Linux / NVIDIA NVML (nvmlDeviceGetTemperature) sysfs /sys/class/thermal/thermal_zone*/temp
Linux / AMD ROCm SMI (rsmi_dev_temp_metric_get) sysfs /sys/class/hwmon/*/temp*_input
macOS IOKit (IOServiceGetMatchingService) IOKit SMC sensors
Windows NVML / ADL SDK WMI MSAcpi_ThermalZoneTemperature

Throttling Policy

  • < 80°C Normal operation. No action taken.
  • 80–89°C (warn) Warning logged. Metric linus_ai_thermal_state set to 1. No throughput impact.
  • ≥ 90°C (throttle) Max batch size halved. Inter-step delay of 20 ms inserted. Metric state set to 2. Warning emitted to stderr and structured log.
  • ≥ 95°C (critical) All new requests rejected (503). In-flight requests complete. Inference resumes when temperature drops below 88°C for 30 s.

Prometheus Metrics

GET /metrics — thermal gauges
# HELP linus_ai_gpu_temp_celsius GPU temperature in Celsius
# TYPE linus_ai_gpu_temp_celsius gauge
linus_ai_gpu_temp_celsius{device="0"} 72.0
linus_ai_gpu_temp_celsius{device="1"} 74.5
 
# HELP linus_ai_cpu_temp_celsius CPU package temperature
# TYPE linus_ai_cpu_temp_celsius gauge
linus_ai_cpu_temp_celsius{package="0"} 58.0
 
# HELP linus_ai_thermal_state Thermal state: 0=ok 1=warn 2=throttle 3=critical
# TYPE linus_ai_thermal_state gauge
linus_ai_thermal_state 0

Guardian Process

In server mode, linus-ai-guardian acts as a parent process that supervises the main inference server, ensuring availability, graceful upgrades, and controlled restarts without human intervention.

Architecture

guardian process tree
PID 1001   linus-ai-guardian         ← parent supervisor
  PID 1002   linus-ai --serve        ← main inference server
               ├─ HTTP server (8080)
               ├─ Inference engine
               ├─ Mesh node
               └─ Vault

Guardian ↔ Main IPC: Unix socket at /var/run/linus-ai-ctl.sock
  Messages: reload | shutdown | status | heartbeat

Watchdog

  • Crash restart Main process exit detected via waitpid. Guardian restarts with exponential backoff: 1 s, 2 s, 4 s, 8 s, 16 s cap. After 5 consecutive crashes in 60 s, guardian enters fail-safe mode and pages ops via LINUS_AI_ALERT_URL webhook.
  • OOM kill threshold Guardian polls /proc/PID/status (Linux) or task_info (macOS). If RSS exceeds oom_threshold_gb, guardian sends SIGTERM, waits drain_timeout, then SIGKILL.
  • Deadlock detection Main process sends a heartbeat every 5 s over the control socket. Guardian expects a heartbeat within 30 s; if missed, it assumes deadlock and forcefully restarts.

Zero-Downtime Upgrade

hot upgrade — no downtime
$ linus-ai --upgrade /path/to/new-binary
→ Guardian: new binary verified (Ed25519 sig OK)
→ Guardian: signaling main process to drain (SIGUSR1)
→ Main: draining 3 in-flight requests…
→ Main: all requests complete. Exiting cleanly.
→ Guardian: replacing binary on disk
→ Guardian: spawning new main process
✓ Upgrade complete. Version: 2.4.1 → 2.5.0

Control Interface

  • PID file/var/run/linus-ai.pid — written by guardian on startup.
  • Control socket/var/run/linus-ai-ctl.sock — Unix domain socket for IPC messages.
  • Reload configlinus-ai --reload — guardian sends reload message; main re-reads config.toml without restart.
  • Graceful shutdownlinus-ai --shutdown — drains requests, guardian exits cleanly.
  • Statuslinus-ai --status — returns PID, uptime, version, request counts, memory usage.

Performance Tuning

Extracting maximum throughput from LINUS-AI requires understanding the two distinct phases of inference, the memory hierarchy, and the knobs that control scheduling.

Prefill vs Decode Phases

Phase Bottleneck Optimisation
Prefill (process input) Compute-bound — processes all input tokens in parallel; GPU FLOPS is the limit. Larger batch sizes. Flash Attention. Mixed-precision (BF16 on Ampere+).
Decode (generate output) Memory-bandwidth-bound — generates one token per step; must load all weights each step. Quantisation (reduces bytes/param). Speculative decoding. Continuous batching.

Token Throughput Formula

throughput_tok/s = (batch_size × tokens_per_step) / step_time_seconds

step_time is dominated by weight loading (decode phase) or matrix multiplication (prefill). On a single H100 running a Q4_K_M 70B model, expect ~35 tokens/s at batch=1, ~280 tokens/s at batch=8 (continuous batching).

Quantisation Impact

Quantisation Bits/param 70B VRAM Quality Speed (relative)
F16 16 140 GB Baseline 1.0×
Q8_0 8 70 GB ~99% F16 1.8×
Q4_K_M 4.5 (avg) 39 GB ~97% F16 2.9×
Q2_K 2.6 23 GB ~85% F16 3.8×

KV Cache Sizing

KV cache competes with model weights for VRAM. After loading the model, compute the remaining VRAM and size the KV cache accordingly:

available_for_kv = total_vram − model_bytes − activation_headroom(~2GB) max_context = available_for_kv / (2 × n_layers × n_heads × head_dim × dtype_bytes)

Optimal Batch Sizes by Hardware

Hardware VRAM Max Model (Q4_K_M) Optimal Batch Expected tok/s
RTX 4090 24 GB 30B 4–8 ~80 tok/s
A100 80GB 80 GB 70B 8–16 ~180 tok/s
H100 SXM 80 GB 70B (Q8) / 40B (F16) 16–32 ~400 tok/s
4× H100 (TP=4) 320 GB 405B (Q4_K_M) 32–64 ~320 tok/s
M2 Ultra (Mac) 192 GB (unified) 70B (Q8) 2–4 ~25 tok/s

NUMA Awareness

On multi-socket servers, pin LINUS-AI workers to the NUMA node that physically connects to the target GPU. Cross-NUMA memory accesses add ~100 ns latency and reduce PCIe bandwidth. Use numactl on Linux:

NUMA pinning
# Find which NUMA node each GPU is on
$ nvidia-smi topo -m
 
# Pin linus-ai to NUMA node 0 (GPU 0,1 are on node 0)
$ numactl --cpunodebind=0 --membind=0 linus-ai --serve --gpu-ids 0,1

Huge Pages

For very large KV caches (>8 GB) allocate 2 MB huge pages to reduce TLB pressure. The inference engine will use huge pages automatically if they are available.

huge pages — Linux
# Allocate 4096 × 2MB huge pages = 8 GB
$ echo 4096 | sudo tee /proc/sys/vm/nr_hugepages
 
# Persist across reboots
$ echo "vm.nr_hugepages = 4096" | sudo tee -a /etc/sysctl.conf

Scalability Limits

The following limits reflect what has been tested in the LINUS-AI engineering environment. Real-world limits depend on network topology, model architecture, and workload characteristics.

Scale Envelope

Dimension Tested Limit Practical Maximum Bottleneck
GPUs per node (TP) 8× H100 640 GB VRAM → 405B params NVLink bandwidth at TP=8
Nodes in mesh (PP) 32 nodes ~2,560 GB VRAM across 32× H100 Network inter-stage latency
Concurrent streams 100+ (continuous batching) KV cache VRAM exhaustion Memory bandwidth (decode)
Context length 128K tokens (with offloading) VRAM + RAM KV cache budget KV cache size
Model size 405B (Llama 3.1 405B, Q4_K_M) Bound by total cluster VRAM Compute throughput (prefill)

Bottleneck Analysis

MEM

Memory Bandwidth (Decode Phase)

Each decode step loads the full model weight set from VRAM. At 70B params × 4 bits = 35 GB, a single H100 with 3.35 TB/s HBM3 bandwidth completes a weight load in ~10 µs, yielding a theoretical max of ~100 tokens/s at batch=1. Quantisation and batching are the primary levers.

COMP

Compute Throughput (Prefill Phase)

Prefill is matrix-multiplication-bound. H100 delivers 989 TFLOPS at BF16. A 70B model prefill of 4K tokens requires ~560 TFLOPS — taking ~0.57 ms on one H100. Longer prompts scale linearly; 128K prompts take ~18 ms prefill time.

NET

Network (Mesh / Pipeline)

Inter-stage activation tensors for a 70B model at BF16 are ~2 MB per layer boundary per sequence. At 100 Gbps Ethernet (12.5 GB/s), this is ~160 µs per boundary. A 3-stage pipeline with 100 concurrent sequences transfers 200 MB/s of activations — well within 100 Gbps capacity.

Security Architecture

LINUS-AI is designed assuming adversarial environments: network interception, binary inspection, and even hardware theft. The security model is explicit about what is protected and what requires trust.

Licence Signature Verification

Licence keys are signed with an Ed25519 private key held exclusively by the LINUS-AI licence server. The corresponding public key is embedded in the binary at compile time. On activation, the binary verifies the licence payload signature against the embedded public key. An attacker who reverse-engineers the binary obtains only the public key — which cannot be used to forge signatures. Brute-forcing the Ed25519 private key is computationally infeasible.

Machine Binding

machine_id = SHA256(hostname ‖ MAC_address ‖ platform_OS)

The machine ID is included in the licence payload at activation time. On each startup, LINUS-AI recomputes the machine ID and compares it against the licence. A mismatch causes a graceful startup failure with a clear error message. Seat transfers are handled via the licence portal: old seat deactivated server-side, new activation issued.

Threat Model

Threat Mitigation Residual Risk
Binary reverse engineering to forge licence Ed25519 asymmetric signing. Public key in binary; private key server-only. None — asymmetric crypto
Network interception (mesh traffic) mTLS with Ed25519 node certificates on all mesh connections. None — encrypted + authenticated
Memory dump to extract vault key Key is derived on-the-fly from hardware identifiers. No serialised key in heap. Key derivable if hardware identifiers accessible
Vault database stolen from disk All blobs AES-256-GCM encrypted. Key requires live hardware to derive. Unreadable without original hardware
Hardware theft (full machine) Attacker has CPUID + MAC — can derive vault key if hostname known. Use vault export + passphrase for high-value data
Audit log tampering Hash-chained blocks. Any modification invalidates all subsequent hashes. Detectable by --verify-audit
Process injection / side-channel Guardian monitors process integrity. Binary signed with Ed25519. OS-level security required for full protection

Vault Threat Model

Hardware unchanged: Vault key is fully derivable from CPUID + MAC + hostname. Data is accessible. This is the intended use case — the user owns the hardware.

Hardware stolen, hostname unknown: Attacker must brute-force the hostname component (usually trivial from OS config files on the same disk). Mitigation: enable full-disk encryption (LUKS, FileVault, BitLocker) in addition to vault encryption.

Hardware stolen, full-disk encryption enabled: Attacker cannot access the SQLite database or derive the hostname. Vault contents are protected.

Recommended Security Configuration

hardened production config.toml
# Production-hardened configuration
[server]
host = "127.0.0.1" # do not bind to 0.0.0.0 unless needed
api_keys = ["sha256:..."] # require API key auth
tls_cert = "/etc/linus-ai/tls/cert.pem"
tls_key = "/etc/linus-ai/tls/key.pem"
 
[vault]
enabled = true
encrypt = true
 
[mesh]
mtls = true # enforced by default
allowed_fingerprints = [ # allowlist node certs
"sha256:abc123...",
"sha256:def456...",
]
 
[audit]
enabled = true
chain_file = "/var/lib/linus-ai/audit.chain"