Local Inference

depends on: zero-infra, privacy, performance, safety

The most private inference is inference that never leaves the device.

Principles

Local by default, remote by choice
User hardware is primary compute substrate
Offline capability for degraded network conditions
Zero round-trip for sensitive data paths
Zero mandatory API key for baseline functionality

Local server inference (primary path)

Local inference servers run on the user's machine and expose an OpenAI-compatible API. This is the MUST-level requirement for Forever Agents — simple to set up, full GPU utilization, no browser constraints.

# Ollama
    ollama serve
    ollama run llama3.1

    # llama.cpp
    ./llama-server -m model.gguf --port 8080

    # LM Studio
    # Launch GUI → start local server

All three expose a compatible HTTP endpoint:

const response = await fetch('http://localhost:11434/v1/chat/completions', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: 'llama3.1',
        messages: [{ role: 'user', content: 'Summarize this document.' }]
      })
    });

Why this is the baseline: trivial installation, proper GPU utilization, full model selection, no browser memory limits, no WebGPU dependency. The network tab shows localhost traffic only — privacy is observable.

In-browser inference (graceful downgrade)

For agents that want zero-server operation, WebLLM can load quantized models in-browser using WebGPU when available.

import { CreateMLCEngine } from '@mlc-ai/web-llm';

    const engine = await CreateMLCEngine('Llama-3.1-8B-Instruct-q4f16_1');
    const reply = await engine.chat.completions.create({
      messages: [{ role: 'user', content: 'Summarize this document.' }]
    });

First run downloads model artifacts; subsequent runs can reuse browser cache.

Use WebWorkers to keep UI responsive:

// main.js
    const worker = new Worker('inference-worker.js', { type: 'module' });
    worker.postMessage({ prompt: 'Summarize this.' });

    // inference-worker.js
    import { CreateMLCEngine } from '@mlc-ai/web-llm';

    self.onmessage = async ({ data }) => {
      const engine = await CreateMLCEngine('Llama-3.1-8B-Instruct-q4f16_1');
      const result = await engine.chat.completions.create({
        messages: [{ role: 'user', content: data.prompt }]
      });
      self.postMessage(result);
    };

This is a MAY-level capability: powerful for maximum sovereignty, but not required to qualify as a Forever Agent.

In-browser embeddings (graceful downgrade)

import { pipeline } from '@xenova/transformers';
    const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
    const embedding = await embedder('Document text here', { pooling: 'mean', normalize: true });

With browser storage/indexing, this supports local RAG pipelines without server round-trips. Like in-browser inference, this is a MAY-level graceful downgrade for zero-server deployments.

In-browser execution (graceful downgrade)

WebAssembly-based virtualization/emulation can provide Linux-like tool execution in-browser when needed. This is the most advanced graceful downgrade — a MAY-level capability for agents that need sandboxed execution without any local server process.

For agents

Detect a local inference server first (Ollama, llama.cpp, LM Studio)
Fall back to in-browser inference (WebLLM) if no local server is available
Fall back to remote inference as a last resort, with explicit user consent
Cache model assets and disclose first-load size/time for in-browser paths
Run in-browser inference in workers, not main UI thread
Use local embeddings/retrieval for sensitive document workflows

← All contexts