UPAS - Offline-First Procedural Guidance

The Server-Free Vision

Traditional AI applications send user queries to remote servers for processing. This creates dependencies:

Network connectivity required
API keys and authentication
Data leaves the device
Latency for each request

UPAS takes a different path: all inference happens in the browser.

WebGPU: GPU-Accelerated Inference

Modern browsers support WebGPU, a low-level graphics and compute API:

import { CreateMLCEngine } from '@mlc-ai/web-llm';

const engine = await CreateMLCEngine(modelId);
const response = await engine.chat.completions.create({
  messages: [{ role: "user", content: question }],
  stream: true,
});

WebGPU provides:

GPU acceleration: Parallel compute on device GPU
Streaming responses: Tokens appear as generated
Large models: Can run 0.5B–3B parameter models
No server: Everything runs locally

WASM Fallback

Not all devices support WebGPU. UPAS falls back to WASM:

import { Wllama } from '@wllama/wllama';

const wllama = new Wllama(wasmAssets);
await wllama.loadModelFromUrl(modelUrl);
const result = await wllama.createCompletion(prompt);

WASM provides:

Universal compatibility: Works in any modern browser
No GPU required: CPU-based inference
Smaller models: Optimised for constrained devices

Runtime Detection

UPAS automatically selects the best available runtime:

async function selectRuntime() {
  if (navigator.gpu) {
    try {
      const adapter = await navigator.gpu.requestAdapter();
      if (adapter) return 'webgpu';
    } catch {}
  }
  return 'wasm';
}

A badge in the UI indicates which runtime is active.

Trade-offs

WebGPU

Pro	Con
Fast inference	Requires modern browser
Streaming	Higher power consumption
Larger models	Device must have GPU

WASM

Pro	Con
Universal support	Slower inference
Lower power	Smaller models only
Works everywhere	No streaming

Model Selection

Model choice affects both approaches:

Model Size	WebGPU	WASM
0.5B	Fast	Usable
1B	Good	Slow
3B	Moderate	Impractical

For field deployments, 0.5B models often provide the best trade-off.

Privacy Preserved

Because inference runs locally:

Queries never leave the device
No server logs of user input
No API call traces
Complete operational privacy

This matters enormously for humanitarian contexts with vulnerable populations.

Try It

UPAS is open source. See the documentation for setup instructions.

Browser-Only AI: WebGPU and WASM Inference