LINUS-AI

System Architecture & Flow Diagrams  ·  v1.4.0  ·  README  ·  SPEC

Overview
Runtime Flow
Tensor Parallel
Pipeline Parallel
Payment Gate
Mesh Discovery
Launch Shell
Blockchain Ledger
Build Pipeline
Test Suite
System Overview
LINUS-AI is a dual-runtime local AI system: a Rust binary for production and a Python runtime for development. Both serve identical HTTP APIs and the same embedded control panel HTML.
User Browser
http://localhost:9480/app
Rust
linus_ai binary
axum 0.7 · embedded HTML
Python
main.py runtime
asyncio TCP · same routes
InferenceEngine
llama-server / ollama / cli
MeshNetwork
UDP beacon + HTTP probe
PaymentLedger
SQLite · internet billing
Blockchain
SHA-256 hash chain
Inference backends
4
llama-server · ollama · llama-cli · fallback
Pipeline nodes max
N
Unlimited — proportional RAM assignment
Welcome credits
5
Free inference units for new internet nodes
Test coverage
340
Python (286) + Rust (54) unit tests
Platforms
7
macOS · Linux · Windows · Android · iOS · WASM
GUI tabs
11
Chat · Models · Mesh · Chorus · Tasks · Thermal · Ledger · Hardware · Permissions · Setup · Launch
Subsystem Map
SubsystemLocationRoleStatus
InferenceEnginelinus-ai-rs/crates/linus-ai-inferenceSelect + run backendactive
MeshNetworklinus-ai-rs/crates/linus-ai-netUDP beacon + peer DBactive
PipelineOrchestratorai/pipeline.pyLayer-parallel inferencev1.0
MeshModelManagerai/mesh_model_manager.pyAuto model selectionactive
PaymentLedgerblockchain/__init__.pyInternet node billingactive
TransparencyLedgerblockchain/__init__.pySHA-256 hash chain auditactive
Shell Handlermain.py + linus-ai-http/src/lib.rsLaunch shell (in-app)active
ThermalGovernorlinus-ai-rs/crates/linus-ai-thermal5-stage throttleactive
Vaultlinus-ai-rs/crates/linus-ai-vaultChaCha20 secret storeactive
task Schedulerlinus-ai-rs/crates/linus-ai-taskDistributed jobsoptional
py2c Compilerpy2c/Python→C→native cross-compilebundled
Runtime Request Flow
End-to-end lifecycle of a user request — from browser input through the Inference Mode switcher to the correct backend. The mode is persisted in localStorage and restored on every page load.
1 — User Entry
Browser / Tray
http://localhost:9480/app
macOS · Windows · Linux
iOS · Android (remote)
Control Panel HTML
linus_ai_control_panel.html
embedded in binary via
include_str!()
setInferMode()
restored from localStorage
linus_ai_infer_mode
local · pipeline · tensor_rpc · tensor_native
Chat tab
user types prompt
clicks Send or presses Enter
2 — Inference Mode Routing (JS → API)
Local
Single-node inference.
JS calls: POST /infer
or POST /infer/stream (SSE)
Agentic: POST /agent/stream
Source: sendChat() in control panel
Pipeline
Layers distributed across N nodes.
Setup: Apply Plan in Setup tab → POST /pipeline/plan
JS calls: POST /pipeline/infer
Intra-pipeline: POST /pipeline/forward
Source: applyPipelinePlan(), ai/pipeline.py
Tensor RPC
Weight matrices split via llama.cpp RPC.
Setup: Apply Plan → POST /tensor/plan,
Start RPC Worker → POST /tensor/rpc/start
JS calls: POST /tensor/infer
Source: applyTensorPlan(), tensor_parallel.rs
Tensor Native (Phase 2)
AllReduce over HTTP, no external tools.
Setup: Apply Plan → POST /tensor/plan (backend=Native)
JS calls: POST /tensor/infer
AllReduce: POST /tensor/allreduce (TNSR frame)
Source: native_allreduce_infer(), tensor_parallel.rs
3 — Server Dispatch (Rust · Python)
axum
HTTP Router
linus-ai-http/src/lib.rs
routes all /infer* /tensor* /pipeline*
infer_handler()
local: InferenceEngine
llama-server · ollama · candle · cli
pipeline_infer()
PipelineOrchestrator
Python: ai/pipeline.py
tensor_infer()
rpc_infer() → llama-server --rpc
or native_allreduce_infer()
Response
JSON · SSE stream
returned to browser
4 — Local Backend Priority (InferenceEngine)
InferenceEngine
engine.rs
dispatch logic
1st
bundled llama.cpp
Metal / CUDA · feature: llama-bundled
2nd
candle (pure Rust)
Metal / CUDA · feature: candle-only
3rd
llama-server subprocess
external binary on PATH
4th
ollama subprocess
ollama serve on PATH
5 — Setup Tab → Mode Config (GUI Source Map)
Setup CardActionAPI endpointJS functionSource
Inference ModeClick mode button— (localStorage only)setInferMode()control panel JS
Tensor ParallelismApply PlanPOST /tensor/planapplyTensorPlan()tensor_parallel.rs
Tensor Parallelism▶ Start RPC WorkerPOST /tensor/rpc/starttensorRpcStart()TpRpcWorker::spawn()
Tensor Parallelism⟳ StatusGET /tensor/statusloadTensorStatus()tensor_status() handler
Pipeline ParallelismApply PlanPOST /pipeline/planapplyPipelinePlan()ai/pipeline.py PipelineOrchestrator
Pipeline Parallelism⟳ StatusGET /pipeline/planloadPipelineStatus()_handle_pipeline_plan()
Runtime ControlsSavePOST /settingssaveRuntimeSettings()settings_save() handler
Tensor Parallelism — Weight Matrix Splitting
Every weight matrix in every transformer layer is split column-wise across N nodes. All nodes receive the same token and compute a 1/N partial result simultaneously. Results are summed (AllReduce) after each block. Rank 0 = coordinator — the only node that accepts client inference requests.
Coordinator (rank 0)
Worker (rank 1…N-1)
Reduced result
Partial activations
RPC Mode — llama.cpp --rpc (Production)
Client
POST /tensor/infer
{prompt, max_tokens}
rank 0
Coordinator
rpc_infer()
tensor_parallel.rs:341
llama-server
--rpc w1:9099,w2:9099
--tensor-split (by RAM)
--n-gpu-layers auto
weight
shards
rank 1
llama-rpc-server
worker A · port 9099
holds W/N weights
rank 2
llama-rpc-server
worker B · port 9099
holds W/N weights
token
out
Response
text returned
to client
Native AllReduce Mode — TNSR Wire Protocol (Phase 2)
Client
POST /tensor/infer
rank 0
Coordinator
loads W_col_slice
computes partial_0 = X @ W0
rank 1
Worker
loads W_col_slice
computes partial_1 = X @ W1
rank N-1
Worker
loads W_col_slice
computes partial_N = X @ WN
POST /tensor
/allreduce
TNSR frame
AllReduceCoordinator
accumulates world_size frames
sums f32 arrays element-wise
returns reduced tensor
Full activation
Σ partials
→ next layer input
TensorParallelPlan fields
plan_id String world_size usize # total ranks local_rank usize # this node backend TpBackend # Rpc | Native rpc_port u16 # worker listen coord_server_port u16 # coordinator model_path String peers Vec<TpPeer> # TpPeer rank usize node_id String address String # host:port rpc_address String # host:rpc_port ram_mb u64 # tensor-split weight gpu_backend Option<String>
TNSR Wire Frame (native mode)
# Binary · little-endian 0 4B magic = b"TNSR" 4 1B rank (u8) 5 1B world_size (u8) 6 16B request_id ([u8;16] UUID) 22 4B element_count (u32 LE) 26 N×4B f32 elements (LE)
Sent via POST /tensor/allreduce.
Coordinator waits for all world_size frames, then sums and returns the reduced frame.
Weight Splitting (RPC mode)
tensor_split_arg() — proportional to node RAM:
frac_i = ram_i / Σ ram

head_range() — attention heads split linearly:
heads_i = floor(total / N)
remainder heads → last rank

build_plan_from_peers() — auto-builds plan
from mesh peers; ranked by composite_score desc.

Source: linus-ai-inference/src/tensor_parallel.rs
Tensor Parallelism API
MethodPathDescriptionHandler
GET/tensor/planReturn active plan or {"plan":null}tensor_plan_get()
POST/tensor/planSet TensorParallelPlantensor_plan_set()
GET/tensor/statusPlan summary + RPC worker healthtensor_status()
POST/tensor/inferRun TP inference (coordinator only)tensor_infer()
POST/tensor/allreduceSubmit TNSR partial frame; returns reducedtensor_allreduce()
POST/tensor/rpc/startSpawn llama-rpc-server subprocesstensor_rpc_start()
POST/tensor/rpc/stopKill llama-rpc-servertensor_rpc_stop()
Parallelism Strategy Comparison
StrategySplit axisCommunicationLatencyBest for
LocalNoneNoneLowestModel fits on 1 node
Tensor RPCWeight matrix widthParallel → AllReduce per blockLow (same-clock nodes)Large models, fast LAN, homogeneous nodes
PipelineLayer depthSequential A→B→CMedium (serial hops)Model too large for any single node
HybridBothPipeline groups + TP inside eachConfigurable70B+ on heterogeneous mesh
Pipeline Parallelism — Multi-Machine Inference
Models larger than any single machine's RAM run across N mesh nodes. Each node holds a contiguous range of transformer layers. Activations flow node-to-node over HTTP using a compact binary wire protocol.
Head node
Mid node(s)
Tail node
Activation tensor
Inference Request Path
Client
POST /pipeline/infer
HEAD
Head Node
embed tokens
layers 0–K
activation
tensor
MID
Mid Node
layers K–M
POST /pipeline/forward
activation
tensor
TAIL
Tail Node
layers M–N
logits → token
Response
token / text
Layer Assignment (LayerPlanner)
1. Read GGUF tensor index (offsets only, no weights)
2. Enumerate all nodes: local + live peers
3. Compute RAM fraction per node
4. Assign floor(ram_frac × total_layers) layers
5. Distribute remainder to highest-RAM nodes
6. Assign roles: head | mid | tail
Tensor Wire Format
# Binary frame (little-endian) 4B magic = 0x4C4E5350 # "LNSP" 4B seq_id (uint32) 4B dtype # 0=f32 1=f16 4B ndim (uint32) N×4B shape (uint32 each) M×B raw tensor data
Sent over HTTP POST /pipeline/forward.
Meta JSON prepended as [4B len][JSON][frame].
GGML Quantization Support
TypeBlockBytes/block
F321 elem4
F161 elem2
Q4_032 elem18
Q8_032 elem34
Q4_K256 elem144
Q6_K256 elem210
KV Cache & Autoregressive Decode
Prefill phase
process all prompt tokens
fill KV cache on each node
Decode loop
1 new token per step
KV cache append only
KV Cache
per-node
per-request
cleared via /pipeline/clear
Token stream
returned to client
after each decode step
Pipeline API Endpoints
MethodPathDescription
POST/pipeline/planConfigure plan from mesh peers + GGUF metadata
POST/pipeline/inferHead-node inference through full pipeline
POST/pipeline/forwardNode-to-node activation forwarding (binary body)
POST/pipeline/clearClear KV cache for a request ID
GET/pipeline/planCurrent plan status + mesh RAM snapshot
Internet Node Payment Gate
LAN and loopback callers are always free. Internet callers (public IP) must have an inference credit balance. New nodes receive 5 free welcome units. 1 unit = 1 inference call = up to 1,000 output tokens.
Inference request
POST /infer
IP check
_is_lan_address(caller_ip)
✓ LAN / loopback
192.168.x.x · 10.x.x.x
172.16–31.x.x · 127.x · ::1
169.254.x.x · fe80::
✓ Allowed
reason: lan_free
no deduction
✓ Balance ≥ 1
deduct 1 unit
reason: ok
✗ Balance = 0
402 Payment Required
reason: insufficient_balance:0
Account Lifecycle
First contact — account created with balance = WELCOME_UNITS (5)
Each inference — balance deducted by 1 unit
Provider credit — internet providers earn 1 unit per served request
Top-up — admin adds units via POST /billing/topup
Persistence — SQLite payment_ledger.db, survives restarts
Constants
INFERENCE_UNIT_COST = 1 # units per call TOKENS_PER_UNIT = 1_000 # output tokens WELCOME_UNITS = 5 # on new account
Billing API
MethodPathAction
GET/billingAll accounts
POST/billing/topupAdd units
# Top up a node curl -X POST /billing/topup \ -d '{ "node_id": "abc123", "address": "203.0.113.5", "units": 50 }'
Integration Points
ComponentRoleFile
PaymentLedger.__init__Open/create SQLite DB, load cacheblockchain/__init__.py
check_and_deduct()Atomic check + deductblockchain/__init__.py
_is_lan_address()RFC1918 + loopback + link-local checkblockchain/__init__.py
Payment gate in handle_request()Intercepts POST /infer before routingmain.py:915–944
_api_billing()GET /billing handlermain.py
_api_billing_topup_async()POST /billing/topup handlermain.py
Blockchain auditRecords every top-up eventmain.py → TransparencyLedger.record()
Mesh Discovery & Auto Model Management
Peers discover each other via UDP beacons. Each node probes newly discovered peers over HTTP to collect richer status. MeshModelManager evaluates the collective mesh every 30 s and automatically selects and loads the best model.
Node A starts
broadcasts UDP beacon
port 9481
Node B hears beacon
adds A to peer table
HTTP probe
GET /status → RAM, GPU,
models, accepting_inference
PeerInfo stored
node_id · address · ram_mb
gpu_backend · model_names
MeshModelManager
eval every 30 s
Collect peers
filter: RAM ≥ 3 GB
accepting_inference = true
Choose model
largest model that fits
collective RAM
(headroom 20%)
Single node fit
load model locally
Needs pipeline
configure PipelineOrchestrator
assign layer ranges
Hub scoring
compute_scores()
master_hub · hub · spoke · edge
Auto-push model
push best model to peers
with no models (≤ 8 GB limit)
Blockchain audit
every decision recorded
in TransparencyLedger
Hub Role Scoring
ScoreRoleCriteria
≥ 80master_hubHigh RAM + GPU
≥ 60hubGood RAM or GPU
≥ 40spokeMedium capacity
< 40edgeLimited RAM/GPU
Constants
MIN_PEER_RAM_MB = 3_072 # 3 GB min EVAL_INTERVAL_S = 30 # re-eval rate MAX_AUTO_PUSH = 8 GB # model push limit RAM_HEADROOM_FRAC = 0.20 # 20% reserved
Launch Shell — In-App Terminal
The Launch tab embeds a browser-based terminal directly in the control panel. Commands run on the node hosting the LINUS-AI server. No separate terminal window required.
Launch tab
linus_ai_control_panel.html
shellRun() → fetch()
POST /shell/exec
JSON body:
{command, timeout_s?}
Safety filter
Block dangerous patterns
sh -c command
asyncio subprocess
stdout + stderr captured
JSON response
{ok, stdout, stderr,
exit_code}
Output rendered
appended to
#shellOutput div
Blocked Patterns (Safety Filter)
"rm -rf /" # wipe root "rm -rf ~" # wipe home "> /dev/sda" # overwrite disk "mkfs." # format partition ":(){ :|:&};" # fork bomb "dd if=" # raw disk copy
Quick-Launch Buttons
📂 ls ~/.linus_ai/models/
💾 df -h
🧠 free -h || vm_stat
nvidia-smi
🔍 ps aux | grep llama
📡 curl -s localhost:9480/status
JS Features
↑ / ↓ arrow key command history
Enter key runs command
Clear button resets output
Timeout up to 120 s
stdout + stderr in separate colours
Dual implementation: Python + Rust
API Contract
# Request POST /shell/exec Content-Type: application/json { "command": "ls -la models/", "timeout_s": 30 } # Response { "ok": true, "stdout": "total 12\n...", "stderr": "", "exit_code": 0 }
Blockchain Transparency Ledger
Every LINUS-AI decision — inference, job submit, peer discovery, billing top-up — is recorded in an append-only SHA-256 hash chain backed by SQLite. Blocks are committed every 30 s or when 100 records accumulate. Merkle proofs enable selective verification.
Event occurs
inference · job · peer
billing · hive
LedgerRecord
record_id (uuid)
data dict → SHA-256
Pending buffer
accumulate until
30 s or 100 records
Block sealing
Merkle root of record hashes
+ prev_block_hash
→ SHA-256 block_hash
SQLite persist
table: blocks
verify_chain() on read
Block Structure
block_number INTEGER PK block_hash TEXT SHA-256 prev_hash TEXT chain link merkle_root TEXT record summary timestamp REAL node_id TEXT records_json TEXT JSON array
Merkle Proof
Given a record_hash, get_merkle_proof() returns the sibling-hash path from the record leaf to the block's Merkle root — enabling independent verification of any record without reading the entire chain.

SHA-256 over pairs: H(left || right)
Audit Events Recorded
linus_ai_started
inference_completed
job_submitted
hive_training_requested
billing_topup
Rust Ledger (linus-ai-blockchain)
The Rust binary has its own Ledger in linus-ai-blockchain/src/lib.rs — same design, native Rust implementation with rusqlite. All significant runtime events are recorded there.
LayerImplementationDB file
Pythonblockchain/__init__.py TransparencyLedgerledger.db
Pythonblockchain/__init__.py PaymentLedgerpayment_ledger.db
Rustlinus-ai-blockchain/src/lib.rs Ledgerlinus_ai.db
Build Pipeline
./build.sh [target] [--llama-cpp2 | --candle-only] — single entry point for all targets and engines. The control panel HTML is embedded at compile time via include_str!(). Run ./build.sh list to see built targets and binary sizes.
Source tree
linus-ai-rs/
11 Rust crates
build.sh
target + engine
feature flags
cargo build --release
--features llama-bundled
or candle-only
include_str!()
control panel HTML
embedded in binary
linus_ai binary
single static executable
zero runtime deps
dist/
dist/<target>/linus_ai
dist/<target>-candle/
linus_ai
Inference Engine Selection
--llama-cpp2 (default)
Statically links llama.cpp (C++). Best performance.
GPU added automatically: Metal on macOS, CUDA when nvcc is on PATH.
Output: dist/<target>/linus_ai
⚠ Cannot cross-compile from macOS → Linux/Windows.
Use natively on each Linux/Windows machine, or switch to --candle-only.
--candle-only
Pure-Rust HuggingFace candle. Cross-compile from any host.
GPU: Metal on macOS, CUDA on Linux (auto-detected). ~20-30% slower than llama-cpp2.
Output: dist/<target>-candle/linus_ai
Requires tokenizer.json alongside each .gguf at runtime.
Safe for all CI cross-compilation pipelines.
Host × Target Compatibility
Target macOS ARM64 macOS x86_64 Linux x86_64 Linux ARM64 Windows
macos-arm64 native ✓ llama / candle candle only candle only candle only
macos-x86_64 llama / candle native ✓ candle only candle only candle only
macos-universal llama / candle llama / candle candle only candle only
linux-x86_64 candle only candle only native ✓ llama / candle candle only
linux-arm64 candle only candle only llama / candle native ✓ candle only
windows-x86_64 candle only candle only candle only candle only native ✓
android-arm64 NDK req. NDK req. NDK req. NDK req.
ios-arm64 Xcode req. Xcode req.
ios-sim Xcode req. Xcode req.
Crate Graph
linus_ai (bin) ├─ linus-ai-http routes + handlers ├─ linus-ai-inference engine + backends │ ├─ bundled.rs [llama-bundled] │ ├─ candle_backend.rs [candle-only] │ └─ backend.rs [subprocess fallback] ├─ linus-ai-net mesh + peer ├─ linus-ai-blockchain ledger ├─ linus-ai-thermal governor ├─ linus-ai-task scheduler ├─ linus-ai-vault secret store ├─ linus-ai-guardian auth ├─ linus-ai-launcher tray app └─ linus-ai-core config + types
Build Commands
# Native (auto-detects machine) ./build.sh # Specific target, default engine ./build.sh macos-arm64 # Slim build (pure Rust) ./build.sh macos-arm64 --candle-only # Cross Linux from macOS (candle required) ./build.sh linux-x86_64 --candle-only # Fat binary (arm64 + x86_64) ./build.sh macos-universal # All targets, cross-safe ./build.sh all --candle-only # List targets + built sizes ./build.sh list # Clean a target ./build.sh clean linux-x86_64
Pre-release Backup
# Timestamped source backup bash backup.sh [label] # Output: ~/Desktop/linus-ai-backups/ linus-ai-2026-03-14_label/ linus-ai-rs/ nomad/ (excludes: target/ *.gguf .env)
Keeps 20 most recent backups. Run ./build.sh list after building to verify output.
Test Suite
340 automated tests across 9 Python test modules and 2 Rust test blocks. Run from the project root with bash run_tests.sh.
Python tests
286
stdlib unittest via pytest_runner.py (stdlib shadow workaround)
Rust tests
54
cargo test — linus-ai-http (29) + linus-ai-inference (25)
Total
340
0 failures · 0 errors
SuiteFileTestsCoverage
Blockchain & Paymenttest_blockchain_payment.py37_is_lan_address, NodeAccount, PaymentLedger, TransparencyLedger
Pipeline Codectest_pipeline_codec.py16TensorFrameCodec encode/decode, Dequantizer F32/F16/Q4_0/Q8_0
Pipeline Plannertest_pipeline_planner.py18LayerPlanner 1/2/3 nodes, edge cases, to_dict()
Mesh Model Managertest_mesh_model_manager.py16_is_internet_ip, _compute_layer_ranges, get_status()
Shell & Payment Gatetest_api_shell.py24Shell exec via asyncio, all blocked patterns, payment gate
HTML Launch Tabtest_html_launch_tab.py20Nav item, tab div, JS functions, CSS, existing tabs intact
HTML Parallelism UItest_html_parallelism_ui.py84Inference Mode switcher, Tensor Parallelism card, Pipeline Parallelism card, JS functions, API endpoints
Flowchart Structuretest_flowchart.py71Runtime Flow tab, Tensor Parallel tab, pane content, show() IDs, pre-existing panes intact
Rust HTTPlinus-ai-http/src/lib.rs29Shell safety, timeout clamp, hub scoring, layer split, tensor plan/status/infer JSON shapes
Rust Tensor Parallellinus-ai-inference/src/tensor_parallel.rs25TpFrame encode/decode/round-trip, TensorParallelPlan methods (is_coordinator, local_fraction, rpc_peer_arg, tensor_split_arg, head_range), TpBackend enum, TNSR magic
Exec Instructions
# Install deps (once) pip3 install --break-system-packages pytest numpy # All tests bash run_tests.sh # Python only / Rust only bash run_tests.sh python bash run_tests.sh rust # Single file python3 tests/pytest_runner.py tests/test_blockchain_payment.py -v # Keyword filter bash run_tests.sh python -k payment # Single test python3 tests/pytest_runner.py \ tests/test_pipeline_planner.py::TestLayerPlanner::test_three_nodes_roles -v
Test Infrastructure Notes
pytest_runner.py — stdlib shadow workaround
linus-ai ships a platform/ sub-package that shadows Python's stdlib platform module. pytest imports platform during its own startup — before conftest.py runs.

pytest_runner.py strips linus-ai paths from sys.path, pre-loads stdlib platform into sys.modules, then restores the path before handing off to pytest.main().
conftest.py — module stubs
_build_stubs() populates sys.modules with lightweight stubs for all linus_ai.* sub-packages that aren't being tested. Real modules (blockchain, ai.pipeline, ai.mesh_model_manager) are loaded by absolute file path via load_module().

This avoids the full Python runtime startup (config files, network sockets, etc.) while still testing real production code paths.