---
title: Local Inference
description: Local-first AI — local servers as primary, in-browser as graceful downgrade
tags: [inference, local-first, ollama, privacy]
dependencies: [zero-infra, privacy, performance, safety]
---

# Local Inference

The most private inference is inference that never leaves the device.

## Principles

- Local by default, remote by choice
- User hardware is primary compute substrate
- Offline capability for degraded network conditions
- Zero round-trip for sensitive data paths
- Zero mandatory API key for baseline functionality

## Local server inference (primary path)

Local inference servers run on the user's machine and expose an OpenAI-compatible API. This is the MUST-level requirement for Forever Agents — simple to set up, full GPU utilization, no browser constraints.

```sh
# Ollama
ollama serve
ollama run llama3.1

# llama.cpp
./llama-server -m model.gguf --port 8080

# LM Studio
# Launch GUI → start local server
```

All three expose a compatible HTTP endpoint:

```js
const response = await fetch('http://localhost:11434/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({
    model: 'llama3.1',
    messages: [{ role: 'user', content: 'Summarize this document.' }]
  })
});
```

Why this is the baseline: trivial installation, proper GPU utilization, full model selection, no browser memory limits, no WebGPU dependency. The network tab shows `localhost` traffic only — privacy is observable.

## In-browser inference (graceful downgrade)

For agents that want zero-server operation, WebLLM can load quantized models in-browser using WebGPU when available.

```js
import { CreateMLCEngine } from '@mlc-ai/web-llm';

const engine = await CreateMLCEngine('Llama-3.1-8B-Instruct-q4f16_1');
const reply = await engine.chat.completions.create({
  messages: [{ role: 'user', content: 'Summarize this document.' }]
});
```

First run downloads model artifacts; subsequent runs can reuse browser cache.

Use WebWorkers to keep UI responsive:

```js
// main.js
const worker = new Worker('inference-worker.js', { type: 'module' });
worker.postMessage({ prompt: 'Summarize this.' });

// inference-worker.js
import { CreateMLCEngine } from '@mlc-ai/web-llm';

self.onmessage = async ({ data }) => {
  const engine = await CreateMLCEngine('Llama-3.1-8B-Instruct-q4f16_1');
  const result = await engine.chat.completions.create({
    messages: [{ role: 'user', content: data.prompt }]
  });
  self.postMessage(result);
};
```

This is a MAY-level capability: powerful for maximum sovereignty, but not required to qualify as a Forever Agent.

## In-browser embeddings (graceful downgrade)

```js
import { pipeline } from '@xenova/transformers';
const embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
const embedding = await embedder('Document text here', { pooling: 'mean', normalize: true });
```

With browser storage/indexing, this supports local RAG pipelines without server round-trips. Like in-browser inference, this is a MAY-level graceful downgrade for zero-server deployments.

## In-browser execution (graceful downgrade)

WebAssembly-based virtualization/emulation can provide Linux-like tool execution in-browser when needed. This is the most advanced graceful downgrade — a MAY-level capability for agents that need sandboxed execution without any local server process.

## For agents

1. Detect a local inference server first (Ollama, llama.cpp, LM Studio)
2. Fall back to in-browser inference (WebLLM) if no local server is available
3. Fall back to remote inference as a last resort, with explicit user consent
4. Cache model assets and disclose first-load size/time for in-browser paths
5. Run in-browser inference in workers, not main UI thread
6. Use local embeddings/retrieval for sensitive document workflows
