Local Inference
depends on: zero-infra, privacy, performance, safety
The most private inference is inference that never leaves the device.
Principles
- Local by default, remote by choice
- User hardware is primary compute substrate
- Offline capability for degraded network conditions
- Zero round-trip for sensitive data paths
- Zero mandatory API key for baseline functionality
Local server inference (primary path)
Local inference servers run on the user's machine and expose an OpenAI-compatible API. This is the MUST-level requirement for Forever Agents — simple to set up, full GPU utilization, no browser constraints.
# Ollama
ollama serve
ollama run llama3.1
# llama.cpp
./llama-server -m model.gguf --port 8080
# LM Studio
# Launch GUI → start local server
All three expose a compatible HTTP endpoint:
const response = await fetch('http://localhost:11434/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'llama3.1',
messages: [{ role: 'user', content: 'Summarize this document.' }]
})
});
Why this is the baseline: trivial installation, proper GPU utilization, full model selection, no browser memory limits, no WebGPU dependency. The network tab shows localhost traffic only — privacy is observable.
In-browser inference (graceful downgrade)
For agents that want zero-server operation, WebLLM can load quantized models in-browser using WebGPU when available.
import { CreateMLCEngine } from '@mlc-ai/web-llm';
const engine = await CreateMLCEngine('Llama-3.1-8B-Instruct-q4f16_1');
const reply = await engine.chat.completions.create({
messages: [{ role: 'user', content: 'Summarize this document.' }]
});
First run downloads model artifacts; subsequent runs can reuse browser cache.
Use WebWorkers to keep UI responsive:
// main.js
const worker = new Worker('inference-worker.js', { type: 'module' });
worker.postMessage({ prompt: 'Summarize this.' });
// inference-worker.js
import { CreateMLCEngine } from '@mlc-ai/web-llm';
self.onmessage = async ({ data }) => {
const engine = await CreateMLCEngine('Llama-3.1-8B-Instruct-q4f16_1');
const result = await engine.chat.completions.create({
messages: [{ role: 'user', content: data.prompt }]
});
self.postMessage(result);
};
This is a MAY-level capability: powerful for maximum sovereignty, but not required to qualify as a Forever Agent.
In-browser embeddings (graceful downgrade)
import { pipeline } from '@xenova/transformers';
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const embedding = await embedder('Document text here', { pooling: 'mean', normalize: true });
With browser storage/indexing, this supports local RAG pipelines without server round-trips. Like in-browser inference, this is a MAY-level graceful downgrade for zero-server deployments.
In-browser execution (graceful downgrade)
WebAssembly-based virtualization/emulation can provide Linux-like tool execution in-browser when needed. This is the most advanced graceful downgrade — a MAY-level capability for agents that need sandboxed execution without any local server process.
For agents
- Detect a local inference server first (Ollama, llama.cpp, LM Studio)
- Fall back to in-browser inference (WebLLM) if no local server is available
- Fall back to remote inference as a last resort, with explicit user consent
- Cache model assets and disclose first-load size/time for in-browser paths
- Run in-browser inference in workers, not main UI thread
- Use local embeddings/retrieval for sensitive document workflows