LINUS-AI is a dual-runtime local AI system: a Rust binary for production and a Python runtime for development. Both serve identical HTTP APIs and the same embedded control panel HTML.
End-to-end lifecycle of a user request — from browser input through the Inference Mode switcher to the correct backend. The mode is persisted in localStorage and restored on every page load.
1 — User Entry
Browser / Tray
http://localhost:9480/app macOS · Windows · Linux iOS · Android (remote)
→
Control Panel HTML
linus_ai_control_panel.html embedded in binary via include_str!()
→
setInferMode()
restored from localStorage linus_ai_infer_mode local · pipeline · tensor_rpc · tensor_native
→
Chat tab
user types prompt clicks Send or presses Enter
2 — Inference Mode Routing (JS → API)
Local
Single-node inference. JS calls:POST /infer
or POST /infer/stream (SSE) Agentic:POST /agent/stream Source:sendChat() in control panel
Pipeline
Layers distributed across N nodes. Setup: Apply Plan in Setup tab → POST /pipeline/plan JS calls:POST /pipeline/infer Intra-pipeline:POST /pipeline/forward Source:applyPipelinePlan(), ai/pipeline.py
Tensor RPC
Weight matrices split via llama.cpp RPC. Setup: Apply Plan → POST /tensor/plan,
Start RPC Worker → POST /tensor/rpc/start JS calls:POST /tensor/infer Source:applyTensorPlan(), tensor_parallel.rs
Tensor Native (Phase 2)
AllReduce over HTTP, no external tools. Setup: Apply Plan → POST /tensor/plan (backend=Native) JS calls:POST /tensor/infer AllReduce:POST /tensor/allreduce (TNSR frame) Source:native_allreduce_infer(), tensor_parallel.rs
3 — Server Dispatch (Rust · Python)
axum
HTTP Router
linus-ai-http/src/lib.rs routes all /infer* /tensor* /pipeline*
rpc_infer() → llama-server --rpc or native_allreduce_infer()
→
Response
JSON · SSE stream returned to browser
4 — Local Backend Priority (InferenceEngine)
InferenceEngine
engine.rs dispatch logic
→
1st
bundled llama.cpp
Metal / CUDA · feature: llama-bundled
2nd
candle (pure Rust)
Metal / CUDA · feature: candle-only
3rd
llama-server subprocess
external binary on PATH
4th
ollama subprocess
ollama serve on PATH
5 — Setup Tab → Mode Config (GUI Source Map)
Setup Card
Action
API endpoint
JS function
Source
Inference Mode
Click mode button
— (localStorage only)
setInferMode()
control panel JS
Tensor Parallelism
Apply Plan
POST /tensor/plan
applyTensorPlan()
tensor_parallel.rs
Tensor Parallelism
▶ Start RPC Worker
POST /tensor/rpc/start
tensorRpcStart()
TpRpcWorker::spawn()
Tensor Parallelism
⟳ Status
GET /tensor/status
loadTensorStatus()
tensor_status() handler
Pipeline Parallelism
Apply Plan
POST /pipeline/plan
applyPipelinePlan()
ai/pipeline.py PipelineOrchestrator
Pipeline Parallelism
⟳ Status
GET /pipeline/plan
loadPipelineStatus()
_handle_pipeline_plan()
Runtime Controls
Save
POST /settings
saveRuntimeSettings()
settings_save() handler
Tensor Parallelism — Weight Matrix Splitting
Every weight matrix in every transformer layer is split column-wise across N nodes. All nodes receive the same token and compute a 1/N partial result simultaneously. Results are summed (AllReduce) after each block. Rank 0 = coordinator — the only node that accepts client inference requests.
Coordinator (rank 0)
Worker (rank 1…N-1)
Reduced result
Partial activations
RPC Mode — llama.cpp --rpc (Production)
Client
POST /tensor/infer {prompt, max_tokens}
→
rank 0
Coordinator
rpc_infer() tensor_parallel.rs:341
↓
llama-server
--rpc w1:9099,w2:9099 --tensor-split (by RAM) --n-gpu-layers auto
build_plan_from_peers() — auto-builds plan
from mesh peers; ranked by composite_score desc.
Source:linus-ai-inference/src/tensor_parallel.rs
Tensor Parallelism API
Method
Path
Description
Handler
GET
/tensor/plan
Return active plan or {"plan":null}
tensor_plan_get()
POST
/tensor/plan
Set TensorParallelPlan
tensor_plan_set()
GET
/tensor/status
Plan summary + RPC worker health
tensor_status()
POST
/tensor/infer
Run TP inference (coordinator only)
tensor_infer()
POST
/tensor/allreduce
Submit TNSR partial frame; returns reduced
tensor_allreduce()
POST
/tensor/rpc/start
Spawn llama-rpc-server subprocess
tensor_rpc_start()
POST
/tensor/rpc/stop
Kill llama-rpc-server
tensor_rpc_stop()
Parallelism Strategy Comparison
Strategy
Split axis
Communication
Latency
Best for
Local
None
None
Lowest
Model fits on 1 node
Tensor RPC
Weight matrix width
Parallel → AllReduce per block
Low (same-clock nodes)
Large models, fast LAN, homogeneous nodes
Pipeline
Layer depth
Sequential A→B→C
Medium (serial hops)
Model too large for any single node
Hybrid
Both
Pipeline groups + TP inside each
Configurable
70B+ on heterogeneous mesh
Pipeline Parallelism — Multi-Machine Inference
Models larger than any single machine's RAM run across N mesh nodes. Each node holds a contiguous range of transformer layers. Activations flow node-to-node over HTTP using a compact binary wire protocol.
Head node
Mid node(s)
Tail node
Activation tensor
Inference Request Path
Client
POST /pipeline/infer
→
HEAD
Head Node
embed tokens layers 0–K
activation tensor→
MID
Mid Node
layers K–M POST /pipeline/forward
activation tensor→
TAIL
Tail Node
layers M–N logits → token
→
Response
token / text
Layer Assignment (LayerPlanner)
1. Read GGUF tensor index (offsets only, no weights)
2. Enumerate all nodes: local + live peers
3. Compute RAM fraction per node
4. Assign floor(ram_frac × total_layers) layers
5. Distribute remainder to highest-RAM nodes
6. Assign roles: head | mid | tail
Sent over HTTP POST /pipeline/forward. Meta JSON prepended as [4B len][JSON][frame].
GGML Quantization Support
Type
Block
Bytes/block
F32
1 elem
4
F16
1 elem
2
Q4_0
32 elem
18
Q8_0
32 elem
34
Q4_K
256 elem
144
Q6_K
256 elem
210
KV Cache & Autoregressive Decode
Prefill phase
process all prompt tokens fill KV cache on each node
↓
Decode loop
1 new token per step KV cache append only
→
KV Cache
per-node per-request cleared via /pipeline/clear
→
Token stream
returned to client after each decode step
Pipeline API Endpoints
Method
Path
Description
POST
/pipeline/plan
Configure plan from mesh peers + GGUF metadata
POST
/pipeline/infer
Head-node inference through full pipeline
POST
/pipeline/forward
Node-to-node activation forwarding (binary body)
POST
/pipeline/clear
Clear KV cache for a request ID
GET
/pipeline/plan
Current plan status + mesh RAM snapshot
Internet Node Payment Gate
LAN and loopback callers are always free. Internet callers (public IP) must have an inference credit balance. New nodes receive 5 free welcome units. 1 unit = 1 inference call = up to 1,000 output tokens.
First contact — account created with balance = WELCOME_UNITS (5) Each inference — balance deducted by 1 unit Provider credit — internet providers earn 1 unit per served request Top-up — admin adds units via POST /billing/topup Persistence — SQLite payment_ledger.db, survives restarts
Constants
INFERENCE_UNIT_COST = 1# units per callTOKENS_PER_UNIT = 1_000# output tokensWELCOME_UNITS = 5# on new account
Billing API
Method
Path
Action
GET
/billing
All accounts
POST
/billing/topup
Add units
# Top up a node
curl -X POST /billing/topup \
-d '{
"node_id": "abc123",
"address": "203.0.113.5",
"units": 50
}'
Integration Points
Component
Role
File
PaymentLedger.__init__
Open/create SQLite DB, load cache
blockchain/__init__.py
check_and_deduct()
Atomic check + deduct
blockchain/__init__.py
_is_lan_address()
RFC1918 + loopback + link-local check
blockchain/__init__.py
Payment gate in handle_request()
Intercepts POST /infer before routing
main.py:915–944
_api_billing()
GET /billing handler
main.py
_api_billing_topup_async()
POST /billing/topup handler
main.py
Blockchain audit
Records every top-up event
main.py → TransparencyLedger.record()
Mesh Discovery & Auto Model Management
Peers discover each other via UDP beacons. Each node probes newly discovered peers over HTTP to collect richer status. MeshModelManager evaluates the collective mesh every 30 s and automatically selects and loads the best model.
Node A starts
broadcasts UDP beacon port 9481
↓
Node B hears beacon
adds A to peer table
↓
HTTP probe
GET /status → RAM, GPU, models, accepting_inference
The Launch tab embeds a browser-based terminal directly in the control panel. Commands run on the node hosting the LINUS-AI server. No separate terminal window required.
Launch tab
linus_ai_control_panel.html shellRun() → fetch()
→
POST /shell/exec
JSON body: {command, timeout_s?}
→
Safety filter
Block dangerous patterns
↓
sh -c command
asyncio subprocess stdout + stderr captured
→
JSON response
{ok, stdout, stderr, exit_code}
→
Output rendered
appended to #shellOutput div
Blocked Patterns (Safety Filter)
"rm -rf /"# wipe root"rm -rf ~"# wipe home"> /dev/sda"# overwrite disk"mkfs."# format partition":(){ :|:&};"# fork bomb"dd if="# raw disk copy
↑ / ↓ arrow key command history
Enter key runs command
Clear button resets output
Timeout up to 120 s
stdout + stderr in separate colours
Dual implementation: Python + Rust
Every LINUS-AI decision — inference, job submit, peer discovery, billing top-up — is recorded in an append-only SHA-256 hash chain backed by SQLite. Blocks are committed every 30 s or when 100 records accumulate. Merkle proofs enable selective verification.
Event occurs
inference · job · peer billing · hive
→
LedgerRecord
record_id (uuid) data dict → SHA-256
→
Pending buffer
accumulate until 30 s or 100 records
→
Block sealing
Merkle root of record hashes + prev_block_hash → SHA-256 block_hash
→
SQLite persist
table: blocks verify_chain() on read
Block Structure
block_number INTEGER PK
block_hash TEXT SHA-256
prev_hash TEXT chain link
merkle_root TEXT record summary
timestamp REAL
node_id TEXT
records_json TEXT JSON array
Merkle Proof
Given a record_hash, get_merkle_proof() returns the sibling-hash path from the record leaf to the block's Merkle root — enabling independent verification of any record without reading the entire chain.
The Rust binary has its own Ledger in linus-ai-blockchain/src/lib.rs — same design, native Rust implementation with rusqlite. All significant runtime events are recorded there.
Layer
Implementation
DB file
Python
blockchain/__init__.py TransparencyLedger
ledger.db
Python
blockchain/__init__.py PaymentLedger
payment_ledger.db
Rust
linus-ai-blockchain/src/lib.rs Ledger
linus_ai.db
Build Pipeline
./build.sh [target] [--llama-cpp2 | --candle-only] — single entry point for all targets and engines. The control panel HTML is embedded at compile time via include_str!(). Run ./build.sh list to see built targets and binary sizes.
Statically links llama.cpp (C++). Best performance.
GPU added automatically: Metal on macOS, CUDA when nvcc is on PATH.
Output: dist/<target>/linus_ai ⚠ Cannot cross-compile from macOS → Linux/Windows.
Use natively on each Linux/Windows machine, or switch to --candle-only.
--candle-only
Pure-Rust HuggingFace candle. Cross-compile from any host.
GPU: Metal on macOS, CUDA on Linux (auto-detected). ~20-30% slower than llama-cpp2.
Output: dist/<target>-candle/linus_ai
Requires tokenizer.json alongside each .gguf at runtime.
Safe for all CI cross-compilation pipelines.
# Install deps (once)
pip3 install --break-system-packages pytest numpy
# All tests
bash run_tests.sh
# Python only / Rust only
bash run_tests.sh python
bash run_tests.sh rust
# Single file
python3 tests/pytest_runner.py tests/test_blockchain_payment.py -v
# Keyword filter
bash run_tests.sh python -k payment
# Single test
python3 tests/pytest_runner.py \
tests/test_pipeline_planner.py::TestLayerPlanner::test_three_nodes_roles -v
Test Infrastructure Notes
pytest_runner.py — stdlib shadow workaround
linus-ai ships a platform/ sub-package that shadows Python's stdlib platform module. pytest imports platform during its own startup — before conftest.py runs.
pytest_runner.py strips linus-ai paths from sys.path, pre-loads stdlib platform into sys.modules, then restores the path before handing off to pytest.main().
conftest.py — module stubs
_build_stubs() populates sys.modules with lightweight stubs for all linus_ai.* sub-packages that aren't being tested. Real modules (blockchain, ai.pipeline, ai.mesh_model_manager) are loaded by absolute file path via load_module().
This avoids the full Python runtime startup (config files, network sockets, etc.) while still testing real production code paths.