Pilot programme targeting Q3 2026. Register your interest now.Pilot EOI
Blog
January 10, 2026

Browser-Only AI: WebGPU and WASM Inference

How UPAS runs AI inference entirely in the browser without any server-side processing.

UPAS Product Team
UPAS Product Team
2 mins read

The Server-Free Vision

Traditional AI applications send user queries to remote servers for processing. This creates dependencies:

  • Network connectivity required
  • API keys and authentication
  • Data leaves the device
  • Latency for each request

UPAS takes a different path: all inference happens in the browser.

WebGPU: GPU-Accelerated Inference

Modern browsers support WebGPU, a low-level graphics and compute API:

import { CreateMLCEngine } from '@mlc-ai/web-llm';

const engine = await CreateMLCEngine(modelId);
const response = await engine.chat.completions.create({
  messages: [{ role: "user", content: question }],
  stream: true,
});

WebGPU provides:

  • GPU acceleration: Parallel compute on device GPU
  • Streaming responses: Tokens appear as generated
  • Large models: Can run 0.5B–3B parameter models
  • No server: Everything runs locally

WASM Fallback

Not all devices support WebGPU. UPAS falls back to WASM:

import { Wllama } from '@wllama/wllama';

const wllama = new Wllama(wasmAssets);
await wllama.loadModelFromUrl(modelUrl);
const result = await wllama.createCompletion(prompt);

WASM provides:

  • Universal compatibility: Works in any modern browser
  • No GPU required: CPU-based inference
  • Smaller models: Optimised for constrained devices

Runtime Detection

UPAS automatically selects the best available runtime:

async function selectRuntime() {
  if (navigator.gpu) {
    try {
      const adapter = await navigator.gpu.requestAdapter();
      if (adapter) return 'webgpu';
    } catch {}
  }
  return 'wasm';
}

A badge in the UI indicates which runtime is active.

Trade-offs

WebGPU

ProCon
Fast inferenceRequires modern browser
StreamingHigher power consumption
Larger modelsDevice must have GPU

WASM

ProCon
Universal supportSlower inference
Lower powerSmaller models only
Works everywhereNo streaming

Model Selection

Model choice affects both approaches:

Model SizeWebGPUWASM
0.5BFastUsable
1BGoodSlow
3BModerateImpractical

For field deployments, 0.5B models often provide the best trade-off.

Privacy Preserved

Because inference runs locally:

  • Queries never leave the device
  • No server logs of user input
  • No API call traces
  • Complete operational privacy

This matters enormously for humanitarian contexts with vulnerable populations.

Try It

UPAS is open source. See the documentation for setup instructions.

Wrap-up

Operational guidance shouldn't require constant connectivity. UPAS aims to work seamlessly — whether you're in a well-connected office or a remote field location.

If that sounds like the kind of tooling you want to explore — register your pilot interest or join the discussion on GitHub.