LINUS-AI — System Architecture & Flow Diagrams

System Overview

LINUS-AI is a dual-runtime local AI system: a Rust binary for production and a Python runtime for development. Both serve identical HTTP APIs and the same embedded control panel HTML.

User Browser

http://localhost:9480/app

→

Rust

linus_ai binary

axum 0.7 · embedded HTML

↕

Python

main.py runtime

asyncio TCP · same routes

→

InferenceEngine

llama-server / ollama / cli

↓

MeshNetwork

UDP beacon + HTTP probe

↓

PaymentLedger

SQLite · internet billing

↓

Blockchain

SHA-256 hash chain

Inference backends

4

llama-server · ollama · llama-cli · fallback

Pipeline nodes max

N

Unlimited — proportional RAM assignment

Welcome credits

5

Free inference units for new internet nodes

Test coverage

340

Python (286) + Rust (54) unit tests

Platforms

7

macOS · Linux · Windows · Android · iOS · WASM

GUI tabs

11

Chat · Models · Mesh · Chorus · Tasks · Thermal · Ledger · Hardware · Permissions · Setup · Launch

Subsystem Map

Subsystem	Location	Role	Status
InferenceEngine	`linus-ai-rs/crates/linus-ai-inference`	Select + run backend	active
MeshNetwork	`linus-ai-rs/crates/linus-ai-net`	UDP beacon + peer DB	active
PipelineOrchestrator	`ai/pipeline.py`	Layer-parallel inference	v1.0
MeshModelManager	`ai/mesh_model_manager.py`	Auto model selection	active
PaymentLedger	`blockchain/__init__.py`	Internet node billing	active
TransparencyLedger	`blockchain/__init__.py`	SHA-256 hash chain audit	active
Shell Handler	`main.py` + `linus-ai-http/src/lib.rs`	Launch shell (in-app)	active
ThermalGovernor	`linus-ai-rs/crates/linus-ai-thermal`	5-stage throttle	active
Vault	`linus-ai-rs/crates/linus-ai-vault`	ChaCha20 secret store	active
task Scheduler	`linus-ai-rs/crates/linus-ai-task`	Distributed jobs	optional
py2c Compiler	`py2c/`	Python→C→native cross-compile	bundled

Runtime Request Flow

End-to-end lifecycle of a user request — from browser input through the Inference Mode switcher to the correct backend. The mode is persisted in localStorage and restored on every page load.

1 — User Entry

Browser / Tray

http://localhost:9480/app
macOS · Windows · Linux
iOS · Android (remote)

→

Control Panel HTML

linus_ai_control_panel.html
embedded in binary via
include_str!()

→

setInferMode()

restored from localStorage
linus_ai_infer_mode
local · pipeline · tensor_rpc · tensor_native

→

Chat tab

user types prompt
clicks Send or presses Enter

2 — Inference Mode Routing (JS → API)

Local

Single-node inference.
JS calls: POST /infer
or POST /infer/stream (SSE)
Agentic: POST /agent/stream
Source: sendChat() in control panel

Pipeline

Layers distributed across N nodes.
Setup: Apply Plan in Setup tab → POST /pipeline/plan
JS calls: POST /pipeline/infer
Intra-pipeline: POST /pipeline/forward
Source: applyPipelinePlan(), ai/pipeline.py

Tensor RPC

Weight matrices split via llama.cpp RPC.
Setup: Apply Plan → POST /tensor/plan,
Start RPC Worker → POST /tensor/rpc/start
JS calls: POST /tensor/infer
Source: applyTensorPlan(), tensor_parallel.rs

Tensor Native (Phase 2)

AllReduce over HTTP, no external tools.
Setup: Apply Plan → POST /tensor/plan (backend=Native)
JS calls: POST /tensor/infer
AllReduce: POST /tensor/allreduce (TNSR frame)
Source: native_allreduce_infer(), tensor_parallel.rs

3 — Server Dispatch (Rust · Python)

axum

HTTP Router

linus-ai-http/src/lib.rs
routes all /infer* /tensor* /pipeline*

→

infer_handler()

local: InferenceEngine
llama-server · ollama · candle · cli

pipeline_infer()

PipelineOrchestrator
Python: ai/pipeline.py

tensor_infer()

rpc_infer() → llama-server --rpc
or native_allreduce_infer()

→

Response

JSON · SSE stream
returned to browser

4 — Local Backend Priority (InferenceEngine)

InferenceEngine

engine.rs
dispatch logic

→

1st

bundled llama.cpp

Metal / CUDA · feature: llama-bundled

2nd

candle (pure Rust)

Metal / CUDA · feature: candle-only

3rd

llama-server subprocess

external binary on PATH

4th

ollama subprocess

ollama serve on PATH

5 — Setup Tab → Mode Config (GUI Source Map)

Setup Card	Action	API endpoint	JS function	Source
Inference Mode	Click mode button	— (localStorage only)	`setInferMode()`	`control panel JS`
Tensor Parallelism	Apply Plan	`POST /tensor/plan`	`applyTensorPlan()`	`tensor_parallel.rs`
Tensor Parallelism	▶ Start RPC Worker	`POST /tensor/rpc/start`	`tensorRpcStart()`	`TpRpcWorker::spawn()`
Tensor Parallelism	⟳ Status	`GET /tensor/status`	`loadTensorStatus()`	`tensor_status() handler`
Pipeline Parallelism	Apply Plan	`POST /pipeline/plan`	`applyPipelinePlan()`	`ai/pipeline.py PipelineOrchestrator`
Pipeline Parallelism	⟳ Status	`GET /pipeline/plan`	`loadPipelineStatus()`	`_handle_pipeline_plan()`
Runtime Controls	Save	`POST /settings`	`saveRuntimeSettings()`	`settings_save() handler`

Tensor Parallelism — Weight Matrix Splitting

Every weight matrix in every transformer layer is split column-wise across N nodes. All nodes receive the same token and compute a 1/N partial result simultaneously. Results are summed (AllReduce) after each block. Rank 0 = coordinator — the only node that accepts client inference requests.

Coordinator (rank 0)

Worker (rank 1…N-1)

Reduced result

Partial activations

RPC Mode — llama.cpp --rpc (Production)

Client

POST /tensor/infer
{prompt, max_tokens}

→

rank 0

Coordinator

rpc_infer()
tensor_parallel.rs:341

↓

llama-server

--rpc w1:9099,w2:9099
--tensor-split (by RAM)
--n-gpu-layers auto

weight
shards→

rank 1

llama-rpc-server

worker A · port 9099
holds W/N weights

rank 2

llama-rpc-server

worker B · port 9099
holds W/N weights

token
out→

Response

text returned
to client

Native AllReduce Mode — TNSR Wire Protocol (Phase 2)

Client

POST /tensor/infer

→

rank 0

Coordinator

loads W_col_slice
computes partial_0 = X @ W0

rank 1

Worker

loads W_col_slice
computes partial_1 = X @ W1

rank N-1

Worker

loads W_col_slice
computes partial_N = X @ WN

POST /tensor
/allreduce
TNSR frame→

AllReduceCoordinator

accumulates world_size frames
sums f32 arrays element-wise
returns reduced tensor

→

Full activation

Σ partials
→ next layer input

TensorParallelPlan fields

plan_id String world_size usize # total ranks local_rank usize # this node backend TpBackend # Rpc | Native rpc_port u16 # worker listen coord_server_port u16 # coordinator model_path String peers Vec<TpPeer> # TpPeer rank usize node_id String address String # host:port rpc_address String # host:rpc_port ram_mb u64 # tensor-split weight gpu_backend Option<String>

TNSR Wire Frame (native mode)

# Binary · little-endian 0 4B magic = b"TNSR" 4 1B rank (u8) 5 1B world_size (u8) 6 16B request_id ([u8;16] UUID) 22 4B element_count (u32 LE) 26 N×4B f32 elements (LE)

Sent via POST /tensor/allreduce.
Coordinator waits for all world_size frames, then sums and returns the reduced frame.

Weight Splitting (RPC mode)

tensor_split_arg() — proportional to node RAM:
frac_i = ram_i / Σ ram

head_range() — attention heads split linearly:
heads_i = floor(total / N)
remainder heads → last rank

build_plan_from_peers() — auto-builds plan
from mesh peers; ranked by composite_score desc.

Source: linus-ai-inference/src/tensor_parallel.rs

Tensor Parallelism API

Method	Path	Description	Handler
GET	`/tensor/plan`	Return active plan or `{"plan":null}`	`tensor_plan_get()`
POST	`/tensor/plan`	Set `TensorParallelPlan`	`tensor_plan_set()`
GET	`/tensor/status`	Plan summary + RPC worker health	`tensor_status()`
POST	`/tensor/infer`	Run TP inference (coordinator only)	`tensor_infer()`
POST	`/tensor/allreduce`	Submit TNSR partial frame; returns reduced	`tensor_allreduce()`
POST	`/tensor/rpc/start`	Spawn `llama-rpc-server` subprocess	`tensor_rpc_start()`
POST	`/tensor/rpc/stop`	Kill `llama-rpc-server`	`tensor_rpc_stop()`

Parallelism Strategy Comparison

Strategy	Split axis	Communication	Latency	Best for
Local	None	None	Lowest	Model fits on 1 node
Tensor RPC	Weight matrix width	Parallel → AllReduce per block	Low (same-clock nodes)	Large models, fast LAN, homogeneous nodes
Pipeline	Layer depth	Sequential A→B→C	Medium (serial hops)	Model too large for any single node
Hybrid	Both	Pipeline groups + TP inside each	Configurable	70B+ on heterogeneous mesh

Pipeline Parallelism — Multi-Machine Inference

Models larger than any single machine's RAM run across N mesh nodes. Each node holds a contiguous range of transformer layers. Activations flow node-to-node over HTTP using a compact binary wire protocol.

Head node

Mid node(s)

Tail node

Activation tensor

Inference Request Path

Client

POST /pipeline/infer

→

HEAD

Head Node

embed tokens
layers 0–K

activation
tensor→

MID

Mid Node

layers K–M
POST /pipeline/forward

activation
tensor→

TAIL

Tail Node

layers M–N
logits → token

→

Response

token / text

Layer Assignment (LayerPlanner)

1. Read GGUF tensor index (offsets only, no weights)
2. Enumerate all nodes: local + live peers
3. Compute RAM fraction per node
4. Assign floor(ram_frac × total_layers) layers
5. Distribute remainder to highest-RAM nodes
6. Assign roles: head | mid | tail

Tensor Wire Format

# Binary frame (little-endian) 4B magic = 0x4C4E5350 # "LNSP" 4B seq_id (uint32) 4B dtype # 0=f32 1=f16 4B ndim (uint32) N×4B shape (uint32 each) M×B raw tensor data

Sent over HTTP POST /pipeline/forward.
Meta JSON prepended as [4B len][JSON][frame].

GGML Quantization Support

Type	Block	Bytes/block
F32	1 elem	4
F16	1 elem	2
Q4_0	32 elem	18
Q8_0	32 elem	34
Q4_K	256 elem	144
Q6_K	256 elem	210

KV Cache & Autoregressive Decode

Prefill phase

process all prompt tokens
fill KV cache on each node

↓

Decode loop

1 new token per step
KV cache append only

→

KV Cache

per-node
per-request
cleared via /pipeline/clear

→

Token stream

returned to client
after each decode step

Pipeline API Endpoints

Method	Path	Description
POST	`/pipeline/plan`	Configure plan from mesh peers + GGUF metadata
POST	`/pipeline/infer`	Head-node inference through full pipeline
POST	`/pipeline/forward`	Node-to-node activation forwarding (binary body)
POST	`/pipeline/clear`	Clear KV cache for a request ID
GET	`/pipeline/plan`	Current plan status + mesh RAM snapshot

Internet Node Payment Gate

LAN and loopback callers are always free. Internet callers (public IP) must have an inference credit balance. New nodes receive 5 free welcome units. 1 unit = 1 inference call = up to 1,000 output tokens.

Inference request

POST /infer

→

IP check

_is_lan_address(caller_ip)

→

✓ LAN / loopback

192.168.x.x · 10.x.x.x
172.16–31.x.x · 127.x · ::1
169.254.x.x · fe80::

⚠ Internet IP

Check PaymentLedger

→

✓ Allowed

reason: lan_free
no deduction

✓ Balance ≥ 1

deduct 1 unit
reason: ok

✗ Balance = 0

402 Payment Required
reason: insufficient_balance:0

Account Lifecycle

First contact — account created with balance = WELCOME_UNITS (5)
Each inference — balance deducted by 1 unit
Provider credit — internet providers earn 1 unit per served request
Top-up — admin adds units via POST /billing/topup
Persistence — SQLite payment_ledger.db, survives restarts

Constants

INFERENCE_UNIT_COST = 1 # units per call TOKENS_PER_UNIT = 1_000 # output tokens WELCOME_UNITS = 5 # on new account

Billing API

Method	Path	Action
GET	`/billing`	All accounts
POST	`/billing/topup`	Add units

# Top up a node curl -X POST /billing/topup \ -d '{ "node_id": "abc123", "address": "203.0.113.5", "units": 50 }'

Integration Points

Component	Role	File
`PaymentLedger.__init__`	Open/create SQLite DB, load cache	`blockchain/__init__.py`
`check_and_deduct()`	Atomic check + deduct	`blockchain/__init__.py`
`_is_lan_address()`	RFC1918 + loopback + link-local check	`blockchain/__init__.py`
Payment gate in `handle_request()`	Intercepts POST /infer before routing	`main.py:915–944`
`_api_billing()`	GET /billing handler	`main.py`
`_api_billing_topup_async()`	POST /billing/topup handler	`main.py`
Blockchain audit	Records every top-up event	`main.py → TransparencyLedger.record()`

Mesh Discovery & Auto Model Management

Peers discover each other via UDP beacons. Each node probes newly discovered peers over HTTP to collect richer status. MeshModelManager evaluates the collective mesh every 30 s and automatically selects and loads the best model.

Node A starts

broadcasts UDP beacon
port 9481

↓

Node B hears beacon

adds A to peer table

↓

HTTP probe

GET /status → RAM, GPU,
models, accepting_inference

↓

PeerInfo stored

node_id · address · ram_mb
gpu_backend · model_names

→

MeshModelManager

eval every 30 s

↓

Collect peers

filter: RAM ≥ 3 GB
accepting_inference = true

↓

Choose model

largest model that fits
collective RAM
(headroom 20%)

↓

Single node fit

load model locally

Needs pipeline

configure PipelineOrchestrator
assign layer ranges

→

Hub scoring

compute_scores()
master_hub · hub · spoke · edge

↓

Auto-push model

push best model to peers
with no models (≤ 8 GB limit)

↓

Blockchain audit

every decision recorded
in TransparencyLedger

Hub Role Scoring

Score	Role	Criteria
≥ 80	master_hub	High RAM + GPU
≥ 60	hub	Good RAM or GPU
≥ 40	spoke	Medium capacity
< 40	edge	Limited RAM/GPU

Constants

MIN_PEER_RAM_MB = 3_072 # 3 GB min EVAL_INTERVAL_S = 30 # re-eval rate MAX_AUTO_PUSH = 8 GB # model push limit RAM_HEADROOM_FRAC = 0.20 # 20% reserved

Launch Shell — In-App Terminal

The Launch tab embeds a browser-based terminal directly in the control panel. Commands run on the node hosting the LINUS-AI server. No separate terminal window required.

Launch tab

linus_ai_control_panel.html
shellRun() → fetch()

→

POST /shell/exec

JSON body:
{command, timeout_s?}

→

Safety filter

Block dangerous patterns

↓

sh -c command

asyncio subprocess
stdout + stderr captured

→

JSON response

{ok, stdout, stderr,
exit_code}

→

Output rendered

appended to
#shellOutput div

Blocked Patterns (Safety Filter)

"rm -rf /" # wipe root "rm -rf ~" # wipe home "> /dev/sda" # overwrite disk "mkfs." # format partition ":(){ :|:&};" # fork bomb "dd if=" # raw disk copy

Quick-Launch Buttons

📂 ls ~/.linus_ai/models/
💾 df -h
🧠 free -h || vm_stat
⚡ nvidia-smi
🔍 ps aux | grep llama
📡 curl -s localhost:9480/status

JS Features

↑ / ↓ arrow key command history
Enter key runs command
Clear button resets output
Timeout up to 120 s
stdout + stderr in separate colours
Dual implementation: Python + Rust

API Contract

# Request POST /shell/exec Content-Type: application/json { "command": "ls -la models/", "timeout_s": 30 } # Response { "ok": true, "stdout": "total 12\n...", "stderr": "", "exit_code": 0 }

Blockchain Transparency Ledger

Every LINUS-AI decision — inference, job submit, peer discovery, billing top-up — is recorded in an append-only SHA-256 hash chain backed by SQLite. Blocks are committed every 30 s or when 100 records accumulate. Merkle proofs enable selective verification.

Event occurs

inference · job · peer
billing · hive

→

LedgerRecord

record_id (uuid)
data dict → SHA-256

→

Pending buffer

accumulate until
30 s or 100 records

→

Block sealing

Merkle root of record hashes
+ prev_block_hash
→ SHA-256 block_hash

→

SQLite persist

table: blocks
verify_chain() on read

Block Structure

block_number INTEGER PK block_hash TEXT SHA-256 prev_hash TEXT chain link merkle_root TEXT record summary timestamp REAL node_id TEXT records_json TEXT JSON array

Merkle Proof

Given a record_hash, get_merkle_proof() returns the sibling-hash path from the record leaf to the block's Merkle root — enabling independent verification of any record without reading the entire chain.

SHA-256 over pairs: H(left || right)

Audit Events Recorded

linus_ai_started
inference_completed
job_submitted
hive_training_requested
billing_topup

Rust Ledger (linus-ai-blockchain)

The Rust binary has its own Ledger in linus-ai-blockchain/src/lib.rs — same design, native Rust implementation with rusqlite. All significant runtime events are recorded there.

Layer	Implementation	DB file
Python	`blockchain/__init__.py TransparencyLedger`	`ledger.db`
Python	`blockchain/__init__.py PaymentLedger`	`payment_ledger.db`
Rust	`linus-ai-blockchain/src/lib.rs Ledger`	`linus_ai.db`

Build Pipeline

./build.sh [target] [--llama-cpp2 | --candle-only] — single entry point for all targets and engines. The control panel HTML is embedded at compile time via include_str!(). Run ./build.sh list to see built targets and binary sizes.

Source tree

linus-ai-rs/
11 Rust crates

→

build.sh

target + engine
feature flags

↓

cargo build --release

--features llama-bundled
or candle-only

↓

include_str!()

control panel HTML
embedded in binary

→

linus_ai binary

single static executable
zero runtime deps

→

dist/

dist/<target>/linus_ai
dist/<target>-candle/
linus_ai

Inference Engine Selection

--llama-cpp2 (default)

Statically links llama.cpp (C++). Best performance.
GPU added automatically: Metal on macOS, CUDA when nvcc is on PATH.
Output: dist/<target>/linus_ai
⚠ Cannot cross-compile from macOS → Linux/Windows.
Use natively on each Linux/Windows machine, or switch to --candle-only.

--candle-only

Pure-Rust HuggingFace candle. Cross-compile from any host.
GPU: Metal on macOS, CUDA on Linux (auto-detected). ~20-30% slower than llama-cpp2.
Output: dist/<target>-candle/linus_ai
Requires tokenizer.json alongside each .gguf at runtime.
Safe for all CI cross-compilation pipelines.

Host × Target Compatibility

Target	macOS ARM64	macOS x86_64	Linux x86_64	Linux ARM64	Windows
`macos-arm64`	native ✓	llama / candle	candle only	candle only	candle only
`macos-x86_64`	llama / candle	native ✓	candle only	candle only	candle only
`macos-universal`	llama / candle	llama / candle	candle only	candle only	✗
`linux-x86_64`	candle only	candle only	native ✓	llama / candle	candle only
`linux-arm64`	candle only	candle only	llama / candle	native ✓	candle only
`windows-x86_64`	candle only	candle only	candle only	candle only	native ✓
`android-arm64`	NDK req.	NDK req.	NDK req.	NDK req.	✗
`ios-arm64`	Xcode req.	Xcode req.	✗	✗	✗
`ios-sim`	Xcode req.	Xcode req.	✗	✗	✗

Crate Graph

linus_ai (bin) ├─ linus-ai-http routes + handlers ├─ linus-ai-inference engine + backends │ ├─ bundled.rs [llama-bundled] │ ├─ candle_backend.rs [candle-only] │ └─ backend.rs [subprocess fallback] ├─ linus-ai-net mesh + peer ├─ linus-ai-blockchain ledger ├─ linus-ai-thermal governor ├─ linus-ai-task scheduler ├─ linus-ai-vault secret store ├─ linus-ai-guardian auth ├─ linus-ai-launcher tray app └─ linus-ai-core config + types

Build Commands

# Native (auto-detects machine) ./build.sh # Specific target, default engine ./build.sh macos-arm64 # Slim build (pure Rust) ./build.sh macos-arm64 --candle-only # Cross Linux from macOS (candle required) ./build.sh linux-x86_64 --candle-only # Fat binary (arm64 + x86_64) ./build.sh macos-universal # All targets, cross-safe ./build.sh all --candle-only # List targets + built sizes ./build.sh list # Clean a target ./build.sh clean linux-x86_64

Pre-release Backup

# Timestamped source backup bash backup.sh [label] # Output: ~/Desktop/linus-ai-backups/ linus-ai-2026-03-14_label/ linus-ai-rs/ nomad/ (excludes: target/ *.gguf .env)

Keeps 20 most recent backups. Run ./build.sh list after building to verify output.

Test Suite

340 automated tests across 9 Python test modules and 2 Rust test blocks. Run from the project root with bash run_tests.sh.

Python tests

286

stdlib unittest via pytest_runner.py (stdlib shadow workaround)

Rust tests

54

cargo test — linus-ai-http (29) + linus-ai-inference (25)

Total

340

0 failures · 0 errors

Suite	File	Tests	Coverage
Blockchain & Payment	`test_blockchain_payment.py`	37	_is_lan_address, NodeAccount, PaymentLedger, TransparencyLedger
Pipeline Codec	`test_pipeline_codec.py`	16	TensorFrameCodec encode/decode, Dequantizer F32/F16/Q4_0/Q8_0
Pipeline Planner	`test_pipeline_planner.py`	18	LayerPlanner 1/2/3 nodes, edge cases, to_dict()
Mesh Model Manager	`test_mesh_model_manager.py`	16	_is_internet_ip, _compute_layer_ranges, get_status()
Shell & Payment Gate	`test_api_shell.py`	24	Shell exec via asyncio, all blocked patterns, payment gate
HTML Launch Tab	`test_html_launch_tab.py`	20	Nav item, tab div, JS functions, CSS, existing tabs intact
HTML Parallelism UI	`test_html_parallelism_ui.py`	84	Inference Mode switcher, Tensor Parallelism card, Pipeline Parallelism card, JS functions, API endpoints
Flowchart Structure	`test_flowchart.py`	71	Runtime Flow tab, Tensor Parallel tab, pane content, show() IDs, pre-existing panes intact
Rust HTTP	`linus-ai-http/src/lib.rs`	29	Shell safety, timeout clamp, hub scoring, layer split, tensor plan/status/infer JSON shapes
Rust Tensor Parallel	`linus-ai-inference/src/tensor_parallel.rs`	25	TpFrame encode/decode/round-trip, TensorParallelPlan methods (is_coordinator, local_fraction, rpc_peer_arg, tensor_split_arg, head_range), TpBackend enum, TNSR magic

Exec Instructions

# Install deps (once) pip3 install --break-system-packages pytest numpy # All tests bash run_tests.sh # Python only / Rust only bash run_tests.sh python bash run_tests.sh rust # Single file python3 tests/pytest_runner.py tests/test_blockchain_payment.py -v # Keyword filter bash run_tests.sh python -k payment # Single test python3 tests/pytest_runner.py \ tests/test_pipeline_planner.py::TestLayerPlanner::test_three_nodes_roles -v

Test Infrastructure Notes

pytest_runner.py — stdlib shadow workaround

linus-ai ships a platform/ sub-package that shadows Python's stdlib platform module. pytest imports platform during its own startup — before conftest.py runs.

pytest_runner.py strips linus-ai paths from sys.path, pre-loads stdlib platform into sys.modules, then restores the path before handing off to pytest.main().

conftest.py — module stubs

_build_stubs() populates sys.modules with lightweight stubs for all linus_ai.* sub-packages that aren't being tested. Real modules (blockchain, ai.pipeline, ai.mesh_model_manager) are loaded by absolute file path via load_module().

This avoids the full Python runtime startup (config files, network sockets, etc.) while still testing real production code paths.