System Architecture Overview
LINUS-AI ships as a single statically-linked binary that embeds every subsystem needed for private AI inference. There are no runtime service dependencies, no daemon managers, and no network calls to external infrastructure after licence activation.
Component Diagram
┌─────────────────────────────────────────────────────────────────┐ │ linus-ai binary │ │ │ │ ┌──────────────────────┐ ┌──────────────────────────────┐ │ │ │ HTTP Server │ │ Agent Engine │ │ │ │ (Axum / FastAPI) │◄──│ (tool-use, RAG, planning) │ │ │ │ port 8080 │ └──────────────┬───────────────┘ │ │ └──────────┬───────────┘ │ │ │ │ request │ inference call │ │ ▼ ▼ │ │ ┌──────────────────────────────────────────────────────────┐ │ │ │ Inference Engine │ │ │ │ (linus-ai-inference crate / llama.cpp FFI) │ │ │ │ continuous batching · KV cache · flash attention │ │ │ │ speculative decoding · GGUF loader · quantisation │ │ │ └──────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────┐ ┌──────────────┐ ┌────────────────────┐ │ │ │ Vault │ │ Mesh Node │ │ Blockchain Audit │ │ │ │ AES-256-GCM │ │ linus-ai-net │ │ (audit.chain) │ │ │ │ HKDF key │ │ mTLS · QUIC │ │ hash-chained │ │ │ └─────────────┘ └──────────────┘ └────────────────────┘ │ │ │ │ ┌─────────────────────────┐ ┌────────────────────────────┐ │ │ │ Thermal Monitor │ │ Guardian Process │ │ │ │ linus-ai-thermal crate │ │ linus-ai-guardian crate │ │ │ │ NVML · sysfs · IOKit │ │ watchdog · OOM · upgrade │ │ │ └─────────────────────────┘ └────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────┘
Data Flow
User Request (HTTP POST /v1/chat/completions)
│
▼
HTTP Layer (Axum)
├─ auth check (API key / no-auth localhost)
├─ rate limiting
└─ route to handler
│
▼
Agent Engine
├─ tool-use planning (if agentic mode)
├─ RAG retrieval from Vault (semantic search over HNSW)
└─ build prompt context
│
▼
Inference Engine
├─ tokenise input
├─ prefill forward pass ← compute-bound
├─ KV cache store
└─ decode loop (token-by-token) ← memory-bandwidth-bound
│
▼
Response Stream (SSE / chunked JSON)
├─ tokens streamed as generated
├─ audit record appended (token counts only)
└─ vault store (if memory enabled)
Key Design Principles
Single Binary
The entire stack — HTTP server, inference engine, vault, mesh, audit, thermal monitor, and guardian — compiles into one statically-linked binary. Deployment is a file copy. No dependency managers, no container runtimes, no separate services to keep in sync.
Zero-Cloud Default
After the one-time licence activation (which contacts the licence server over HTTPS), LINUS-AI operates fully air-gapped. All model weights, embeddings, conversation history, and generated text remain on the host hardware. No telemetry callbacks, no model API proxies, no logging to remote endpoints.
Privacy-First
Conversation content is never written to disk in plaintext. Vault entries are encrypted before storage. The audit ledger records only metadata (timestamps, token counts) — never prompt or response content. Key material is derived fresh at startup from hardware identifiers and never persisted.
Inference Engine
The inference engine wraps llama.cpp through the linus-ai-inference
Rust crate, exposing a safe async API over an unsafe FFI boundary. The crate handles GGUF model
loading, context management, and generation scheduling.
llama.cpp Integration (Rust FFI)
The linus-ai-inference crate uses a thin unsafe extern "C" block to
call into the llama.cpp C API. The Rust wrapper adds ownership semantics, error propagation via
Result<T, InferenceError>, and async compatibility through
tokio::task::spawn_blocking for the CPU-intensive forward pass. CUDA kernels run
synchronously on dedicated streams — the blocking wrapper keeps the tokio event loop free.
GGUF Model Format
GGUF (GPT-Generated Unified Format) stores model weights, tokeniser vocabulary, and metadata
in a single file with a self-describing header. LINUS-AI supports all quantisation levels
that llama.cpp exposes: Q2_K, Q3_K_S/M/L, Q4_0,
Q4_K_S/M, Q5_K_S/M, Q6_K, Q8_0,
and F16/BF16. GGUF tensors can be memory-mapped (mmap)
from disk, avoiding a full load before the first token is produced.
Continuous Batching
Under a static batching scheme each generation request occupies one full forward pass. Continuous batching (also called in-flight batching) allows the engine to insert new requests into an ongoing decode step, sharing the forward pass across multiple sequences simultaneously. This dramatically improves GPU utilisation when there is a mix of short and long requests in the queue.
KV Cache
The key-value cache stores attention keys and values for all previously computed tokens, avoiding redundant recomputation during the decode phase. Cache size in bytes:
For a 70B Llama-3 model (80 layers, 64 KV heads, head_dim=128, context 8192, fp16):
2 × 80 × 64 × 128 × 8192 × 2 bytes = ~21.5 GB
kv_cache_type = "offload".
Offloaded cache incurs a PCIe round-trip per decode step. Keep context windows as small as the
workload allows to maximise VRAM residency.
Speculative Decoding
Speculative decoding pairs a small fast draft model with the main model. The draft model autoregressively generates K tokens in a single pass; the main model then verifies all K tokens in one forward pass (parallel, not sequential). Accepted tokens are kept; the first rejected token causes a rollback. Typical accept rates of 80–90% yield 2–4× throughput improvements on memory-bandwidth-bound workloads.
Flash Attention
Flash Attention (Tri Dao et al.) reorders attention computation to be IO-aware, avoiding
materialisation of the full N×N attention matrix in VRAM. Instead, the computation is tiled and
fused into a single CUDA kernel, reducing memory reads/writes from O(N²) to O(N). Enabled by
default when the CUDA backend is active and the GPU has compute capability ≥ 8.0 (Ampere+).
Set flash_attention = false to disable.
Memory Layout
| Component | Location | Notes |
|---|---|---|
| Model weights | VRAM primary | Quantised tensors loaded into GPU memory. Overflow layers CPU-offloaded via gpu_layers. |
| KV cache | VRAM primary | Spills to pinned RAM when VRAM exhausted. Configurable via kv_cache_type. |
| Activations | VRAM (transient) | Held only during forward pass. Flash Attention reduces peak activation memory. |
| Tokeniser vocab | RAM | Small (<50 MB). Always CPU-resident. |
| GGUF file | Disk (mmap) | Memory-mapped; OS page cache handles eviction. |
Tensor Parallelism
Tensor parallelism (TP) splits the weight matrices of each transformer layer across N GPUs. Each GPU holds a vertical shard (column-wise partition) of the weight matrix and computes its partial result independently. The partial results are summed via an all-reduce collective after each linear layer, producing the same output as if the full matrix were on one device.
How It Works
Full weight W [d_model × d_ff] → Split column-wise across 4 GPUs: GPU 0: W[:, 0 : d_ff/4] × X → partial_0 GPU 1: W[:, d_ff/4:d_ff/2] × X → partial_1 GPU 2: W[:,d_ff/2:3*d_ff/4] × X → partial_2 GPU 3: W[:,3*d_ff/4:d_ff ] × X → partial_3 AllReduce (sum): output = partial_0 + partial_1 + partial_2 + partial_3 Each GPU holds 1/N of each weight matrix. AllReduce bandwidth ∝ 2 × (N-1)/N × message_size per layer.
Configuration
Bandwidth Requirements
| Interconnect | Bandwidth (bidirectional) | Max TP Degree | Recommended Use |
|---|---|---|---|
| NVLink 4.0 | 1.8 TB/s | 8 | H100 SXM clusters — optimal for all TP degrees |
| NVLink 3.0 | 600 GB/s | 8 | A100 SXM clusters — excellent for TP ≤ 8 |
| PCIe 5.0 × 16 | 128 GB/s | 4 | Consumer / workstation GPUs — TP ≤ 4, watch latency |
| PCIe 4.0 × 16 | 64 GB/s | 2 | TP 2 only. AllReduce becomes bottleneck at TP 4+ |
RPC Mode vs Native CUDA Mode
Native CUDA mode is used when all GPUs are on the same host. NCCL handles the AllReduce collective directly over NVLink or PCIe. No extra network stack is involved.
RPC mode (handled by the linus-ai-net crate) extends tensor
parallelism across hosts by serialising partial activations and shipping them over QUIC. The
bandwidth requirement is the same as native mode — RPC mode is intended for specialised
InfiniBand or 400G Ethernet clusters, not commodity networks. Use pipeline parallelism for
multi-node deployments on standard LAN.
Memory Savings
model_bytes / N + kv_cache_bytes / N. At TP=8 on
H100s, a 405B parameter model (810 GB in fp16) becomes ~101 GB per GPU — within the 80 GB
per-GPU limit with fp8 quantisation.
When to Use Tensor Parallelism
Pipeline Parallelism
Pipeline parallelism (PP) partitions the transformer layers vertically: stage 0 holds the embedding layer and layers 0–K, stage 1 holds layers K+1–2K, and so on. Each stage processes one micro-batch, then passes its activations to the next stage while immediately starting work on the next micro-batch. The pipeline is filled by splitting each request into smaller micro-batches.
How It Works
Node A (Stage 0) │ Node B (Stage 1) │ Node C (Stage 2)
─────────────────┼───────────────────┼──────────────────
embed + L0-L15 │ L16-L31 │ L32-L47 + head
│ │
Time → [μB0]──▶[μB0]──▶[μB0]──▶ out
[μB1]──▶[μB1]──▶[μB1]──▶ out
[μB2]──▶[μB2]──▶[μB2]──▶ out
[μB3]──▶[μB3]──▶[μB3]──▶ out
Pipeline bubble = (stages - 1) / (micro_batches + stages - 1)
Reduce bubble by increasing micro_batch_count.
Stage Assignment Algorithm
The default stage assignment distributes layers so that each stage has approximately equal parameter count, accounting for the embedding table (which lives on stage 0) and the LM head (stage N-1). Heterogeneous hardware is supported: nodes with lower peak TFLOPS receive fewer layers proportionally. The assignment is computed at cluster startup and re-emitted in the startup log.
Inter-Stage Communication
| Transport | Use Case | Latency | Config |
|---|---|---|---|
| Unix socket | Stages on same host | <1 µs | transport = "unix" |
| TCP | LAN (same rack / datacenter) | ~100 µs | transport = "tcp" |
| QUIC | WAN or unreliable networks | ~5 ms | transport = "quic" |
Bubble Overhead and Micro-Batching
The pipeline bubble is idle time at the beginning and end of each batch where some stages have
no work. With micro_batch_size = 1 the bubble fraction is
(stages-1)/stages — very large. Increasing micro_batch_size amortises
the bubble at the cost of higher end-to-end latency per request. Tune based on your
throughput-vs-latency requirements.
Heterogeneous Hardware Support
If Node A has 2× A100 80GB and Node B has 1× RTX 4090 24GB, set
pipeline_stages = ["0-40", "41-47"] to assign 40 layers to the faster node and
only 7 to the slower one. The coordinator monitors per-stage completion times and can emit
warnings if a stage becomes a bottleneck.
When to Use Pipeline Parallelism
Mesh Networking
The linus-ai-net crate implements a P2P encrypted overlay network that allows
multiple LINUS-AI nodes to form an inference cluster without a centralised broker or VPN.
Nodes communicate directly over authenticated, encrypted channels.
Discovery
-
LAN (mDNS)
Nodes broadcast
_linusai._tcp.localmDNS records. Other nodes on the same L2 segment discover them automatically within seconds. No configuration required. -
WAN (static peers)
Set
peers = ["10.0.0.2:9090", "10.0.0.3:9090"]in[mesh]config. Nodes connect on startup and keep connections alive with heartbeats. -
Kubernetes
Use the headless service DNS records as peer list. Each pod's DNS name resolves to pod IP. Set
peers = ["linus-ai-0.linus-ai.default.svc:9090", ...].
Mutual TLS
Each node generates an Ed25519 keypair and a self-signed X.509 certificate on first run,
stored in ~/.linus_ai/mesh/. When two nodes connect, they perform a full mTLS
handshake — both sides authenticate each other. The coordinator's public key fingerprint is
used as the cluster identity: workers that present a different coordinator cert are rejected.
Node Roles
| Role | Responsibilities | Exposes HTTP? |
|---|---|---|
coordinator |
Accepts client requests, orchestrates layer assignment, routes activation tensors between pipeline stages, monitors worker liveness. | Yes — port 8080 |
worker |
Receives activation tensors from the upstream stage, runs the assigned layers, sends activations to the downstream stage. | No — mesh port only |
Fault Tolerance
The coordinator sends heartbeat pings to each worker every 5 seconds. A worker that fails to respond within 3 consecutive heartbeat intervals (15 s default) is marked dead. The coordinator then reassigns that worker's layers to the remaining workers, redistributing the model partition map. In-flight requests that were routed through the dead worker are cancelled and retried. No manual intervention required.
Topology
Within one cluster: full mesh (each worker ↔ every other worker)
Worker A ──── Worker B
│ ╲ ╱ │
│ ╲ ╱ │
│ ╲ ╱ │
Worker C ──── Worker D
│
Coordinator (exposes port 8080 to clients)
Between clusters: star topology
Cluster 1 Cluster 2
┌─────────┐ ┌─────────┐
│ Coord 1 │◄───────►│ Coord 2 │
└─────────┘ └─────────┘
▲ ▲
workers... workers...
Wire Protocol
LINUS-AI uses a custom binary protocol over QUIC (primary) or TCP (fallback). Each message
has a fixed 16-byte header: [magic(4)] [version(2)] [msg_type(2)] [payload_len(8)],
followed by a length-prefixed payload. Activation tensors are transmitted in their raw binary
representation without re-serialisation overhead. Port 9090 is the default (configurable via
listen_port).
Vault & Encrypted Storage
The vault provides persistent, searchable, end-to-end encrypted memory for LINUS-AI. Conversation summaries, user preferences, and knowledge snippets are stored encrypted and retrieved via semantic similarity search — without the vault key ever touching disk.
Key Derivation
On startup, LINUS-AI collects three hardware identifiers: CPUID (processor model string), MAC address (first non-loopback interface), and hostname. These are concatenated and fed into HKDF-SHA256 with a fixed salt to produce a 256-bit AES key. The key exists only in process memory. No key file is created.
Encryption
- Algorithm AES-256-GCM. Each record has its own 96-bit random nonce. The nonce is stored alongside the ciphertext.
- Auth tag 128-bit GCM authentication tag appended after ciphertext. Tampering with any byte causes authentication failure.
- AAD Record ID and timestamp are used as Additional Authenticated Data, preventing record transplantation attacks.
Storage Format
Encrypted blobs are stored in a SQLite database (~/.linus_ai/vault.db) using
Write-Ahead Logging (WAL) mode for concurrent read safety. The schema has two tables:
records (id, nonce, ciphertext, tag, created_at) and
embeddings (id, vector BLOB). Embeddings are stored in plaintext — they are
dense floating-point vectors that encode semantic meaning without containing readable text.
Semantic Index (HNSW)
The vault maintains an HNSW (Hierarchical Navigable Small World) graph index over embedding
vectors, enabling approximate nearest-neighbour search in sub-millisecond time across millions
of records. The HNSW graph is built in memory at startup from the embeddings table
and updated incrementally as new records are inserted. The index is not encrypted because
embeddings do not contain reconstructable content.
Export / Import
Blockchain Audit Ledger
The audit ledger (Team and Enterprise tiers) provides a tamper-evident, append-only log of AI interactions for compliance and governance. It proves that interactions occurred and characterises them — without storing any sensitive content.
Purpose
Regulated industries (finance, healthcare, legal) require evidence that AI outputs were reviewed, that the correct model was used, and that usage volumes are auditable. The ledger satisfies these requirements while preserving user privacy: prompt text and response text are never written to the ledger.
Block Structure
struct AuditBlock {
block_index : u64, // monotonically increasing
timestamp : i64, // Unix epoch, nanoseconds
session_id : [u8; 16], // UUIDv4, per-conversation
model : [u8; 64], // model name, null-padded
prompt_tokens : u32, // input token count
completion_tokens: u32, // output token count
user_id : [u8; 32], // SHA256(username), never plaintext
prev_hash : [u8; 32], // SHA256 of previous block bytes
block_hash : [u8; 32], // SHA256(all fields above)
}
// Total: 204 bytes per block. NO prompt or response content.
Storage and Verification
-
Location
~/.linus_ai/audit.chain— raw binary, Merkle tree structure. File grows at ~204 bytes per request. -
Integrity check
Recompute SHA256 of each block and verify
block_hash == SHA256(block_bytes)andprev_hash == previous block_hash. Any modification invalidates all subsequent blocks.
Thermal Management
The linus-ai-thermal crate polls GPU and CPU temperatures on a configurable
interval (default 2 s) and applies throttling policies to prevent hardware damage during
sustained inference workloads.
Temperature Sources
| Platform | GPU Source | CPU Source |
|---|---|---|
| Linux / NVIDIA | NVML (nvmlDeviceGetTemperature) |
sysfs /sys/class/thermal/thermal_zone*/temp |
| Linux / AMD | ROCm SMI (rsmi_dev_temp_metric_get) |
sysfs /sys/class/hwmon/*/temp*_input |
| macOS | IOKit (IOServiceGetMatchingService) |
IOKit SMC sensors |
| Windows | NVML / ADL SDK | WMI MSAcpi_ThermalZoneTemperature |
Throttling Policy
- < 80°C Normal operation. No action taken.
-
80–89°C (warn)
Warning logged. Metric
linus_ai_thermal_stateset to1. No throughput impact. -
≥ 90°C (throttle)
Max batch size halved. Inter-step delay of 20 ms inserted. Metric state set to
2. Warning emitted to stderr and structured log. - ≥ 95°C (critical) All new requests rejected (503). In-flight requests complete. Inference resumes when temperature drops below 88°C for 30 s.
Prometheus Metrics
Guardian Process
In server mode, linus-ai-guardian acts as a parent process that supervises the
main inference server, ensuring availability, graceful upgrades, and controlled restarts
without human intervention.
Architecture
PID 1001 linus-ai-guardian ← parent supervisor
PID 1002 linus-ai --serve ← main inference server
├─ HTTP server (8080)
├─ Inference engine
├─ Mesh node
└─ Vault
Guardian ↔ Main IPC: Unix socket at /var/run/linus-ai-ctl.sock
Messages: reload | shutdown | status | heartbeat
Watchdog
-
Crash restart
Main process exit detected via
waitpid. Guardian restarts with exponential backoff: 1 s, 2 s, 4 s, 8 s, 16 s cap. After 5 consecutive crashes in 60 s, guardian enters fail-safe mode and pages ops viaLINUS_AI_ALERT_URLwebhook. -
OOM kill threshold
Guardian polls
/proc/PID/status(Linux) ortask_info(macOS). If RSS exceedsoom_threshold_gb, guardian sends SIGTERM, waitsdrain_timeout, then SIGKILL. - Deadlock detection Main process sends a heartbeat every 5 s over the control socket. Guardian expects a heartbeat within 30 s; if missed, it assumes deadlock and forcefully restarts.
Zero-Downtime Upgrade
Control Interface
- PID file
/var/run/linus-ai.pid— written by guardian on startup. - Control socket
/var/run/linus-ai-ctl.sock— Unix domain socket for IPC messages. - Reload config
linus-ai --reload— guardian sendsreloadmessage; main re-readsconfig.tomlwithout restart. - Graceful shutdown
linus-ai --shutdown— drains requests, guardian exits cleanly. - Status
linus-ai --status— returns PID, uptime, version, request counts, memory usage.
Performance Tuning
Extracting maximum throughput from LINUS-AI requires understanding the two distinct phases of inference, the memory hierarchy, and the knobs that control scheduling.
Prefill vs Decode Phases
| Phase | Bottleneck | Optimisation |
|---|---|---|
| Prefill (process input) | Compute-bound — processes all input tokens in parallel; GPU FLOPS is the limit. | Larger batch sizes. Flash Attention. Mixed-precision (BF16 on Ampere+). |
| Decode (generate output) | Memory-bandwidth-bound — generates one token per step; must load all weights each step. | Quantisation (reduces bytes/param). Speculative decoding. Continuous batching. |
Token Throughput Formula
step_time is dominated by weight loading (decode phase) or matrix multiplication
(prefill). On a single H100 running a Q4_K_M 70B model, expect ~35 tokens/s at batch=1,
~280 tokens/s at batch=8 (continuous batching).
Quantisation Impact
| Quantisation | Bits/param | 70B VRAM | Quality | Speed (relative) |
|---|---|---|---|---|
F16 |
16 | 140 GB | Baseline | 1.0× |
Q8_0 |
8 | 70 GB | ~99% F16 | 1.8× |
Q4_K_M |
4.5 (avg) | 39 GB | ~97% F16 | 2.9× |
Q2_K |
2.6 | 23 GB | ~85% F16 | 3.8× |
KV Cache Sizing
KV cache competes with model weights for VRAM. After loading the model, compute the remaining VRAM and size the KV cache accordingly:
Optimal Batch Sizes by Hardware
| Hardware | VRAM | Max Model (Q4_K_M) | Optimal Batch | Expected tok/s |
|---|---|---|---|---|
| RTX 4090 | 24 GB | 30B | 4–8 | ~80 tok/s |
| A100 80GB | 80 GB | 70B | 8–16 | ~180 tok/s |
| H100 SXM | 80 GB | 70B (Q8) / 40B (F16) | 16–32 | ~400 tok/s |
| 4× H100 (TP=4) | 320 GB | 405B (Q4_K_M) | 32–64 | ~320 tok/s |
| M2 Ultra (Mac) | 192 GB (unified) | 70B (Q8) | 2–4 | ~25 tok/s |
NUMA Awareness
On multi-socket servers, pin LINUS-AI workers to the NUMA node that physically connects to
the target GPU. Cross-NUMA memory accesses add ~100 ns latency and reduce PCIe bandwidth.
Use numactl on Linux:
Huge Pages
For very large KV caches (>8 GB) allocate 2 MB huge pages to reduce TLB pressure. The inference engine will use huge pages automatically if they are available.
Scalability Limits
The following limits reflect what has been tested in the LINUS-AI engineering environment. Real-world limits depend on network topology, model architecture, and workload characteristics.
Scale Envelope
| Dimension | Tested Limit | Practical Maximum | Bottleneck |
|---|---|---|---|
| GPUs per node (TP) | 8× H100 | 640 GB VRAM → 405B params | NVLink bandwidth at TP=8 |
| Nodes in mesh (PP) | 32 nodes | ~2,560 GB VRAM across 32× H100 | Network inter-stage latency |
| Concurrent streams | 100+ (continuous batching) | KV cache VRAM exhaustion | Memory bandwidth (decode) |
| Context length | 128K tokens (with offloading) | VRAM + RAM KV cache budget | KV cache size |
| Model size | 405B (Llama 3.1 405B, Q4_K_M) | Bound by total cluster VRAM | Compute throughput (prefill) |
Bottleneck Analysis
Memory Bandwidth (Decode Phase)
Each decode step loads the full model weight set from VRAM. At 70B params × 4 bits = 35 GB, a single H100 with 3.35 TB/s HBM3 bandwidth completes a weight load in ~10 µs, yielding a theoretical max of ~100 tokens/s at batch=1. Quantisation and batching are the primary levers.
Compute Throughput (Prefill Phase)
Prefill is matrix-multiplication-bound. H100 delivers 989 TFLOPS at BF16. A 70B model prefill of 4K tokens requires ~560 TFLOPS — taking ~0.57 ms on one H100. Longer prompts scale linearly; 128K prompts take ~18 ms prefill time.
Network (Mesh / Pipeline)
Inter-stage activation tensors for a 70B model at BF16 are ~2 MB per layer boundary per sequence. At 100 Gbps Ethernet (12.5 GB/s), this is ~160 µs per boundary. A 3-stage pipeline with 100 concurrent sequences transfers 200 MB/s of activations — well within 100 Gbps capacity.
Security Architecture
LINUS-AI is designed assuming adversarial environments: network interception, binary inspection, and even hardware theft. The security model is explicit about what is protected and what requires trust.
Licence Signature Verification
Licence keys are signed with an Ed25519 private key held exclusively by the LINUS-AI licence server. The corresponding public key is embedded in the binary at compile time. On activation, the binary verifies the licence payload signature against the embedded public key. An attacker who reverse-engineers the binary obtains only the public key — which cannot be used to forge signatures. Brute-forcing the Ed25519 private key is computationally infeasible.
Machine Binding
The machine ID is included in the licence payload at activation time. On each startup, LINUS-AI recomputes the machine ID and compares it against the licence. A mismatch causes a graceful startup failure with a clear error message. Seat transfers are handled via the licence portal: old seat deactivated server-side, new activation issued.
Threat Model
| Threat | Mitigation | Residual Risk |
|---|---|---|
| Binary reverse engineering to forge licence | Ed25519 asymmetric signing. Public key in binary; private key server-only. | None — asymmetric crypto |
| Network interception (mesh traffic) | mTLS with Ed25519 node certificates on all mesh connections. | None — encrypted + authenticated |
| Memory dump to extract vault key | Key is derived on-the-fly from hardware identifiers. No serialised key in heap. | Key derivable if hardware identifiers accessible |
| Vault database stolen from disk | All blobs AES-256-GCM encrypted. Key requires live hardware to derive. | Unreadable without original hardware |
| Hardware theft (full machine) | Attacker has CPUID + MAC — can derive vault key if hostname known. | Use vault export + passphrase for high-value data |
| Audit log tampering | Hash-chained blocks. Any modification invalidates all subsequent hashes. | Detectable by --verify-audit |
| Process injection / side-channel | Guardian monitors process integrity. Binary signed with Ed25519. | OS-level security required for full protection |
Vault Threat Model
Hardware stolen, hostname unknown: Attacker must brute-force the hostname component (usually trivial from OS config files on the same disk). Mitigation: enable full-disk encryption (LUKS, FileVault, BitLocker) in addition to vault encryption.
Hardware stolen, full-disk encryption enabled: Attacker cannot access the SQLite database or derive the hostname. Vault contents are protected.