Browser-Only AI: WebGPU and WASM Inference
How UPAS runs AI inference entirely in the browser without any server-side processing.
The Server-Free Vision
Traditional AI applications send user queries to remote servers for processing. This creates dependencies:
- Network connectivity required
- API keys and authentication
- Data leaves the device
- Latency for each request
UPAS takes a different path: all inference happens in the browser.
WebGPU: GPU-Accelerated Inference
Modern browsers support WebGPU, a low-level graphics and compute API:
import { CreateMLCEngine } from '@mlc-ai/web-llm';
const engine = await CreateMLCEngine(modelId);
const response = await engine.chat.completions.create({
messages: [{ role: "user", content: question }],
stream: true,
});WebGPU provides:
- GPU acceleration: Parallel compute on device GPU
- Streaming responses: Tokens appear as generated
- Large models: Can run 0.5B–3B parameter models
- No server: Everything runs locally
WASM Fallback
Not all devices support WebGPU. UPAS falls back to WASM:
import { Wllama } from '@wllama/wllama';
const wllama = new Wllama(wasmAssets);
await wllama.loadModelFromUrl(modelUrl);
const result = await wllama.createCompletion(prompt);WASM provides:
- Universal compatibility: Works in any modern browser
- No GPU required: CPU-based inference
- Smaller models: Optimised for constrained devices
Runtime Detection
UPAS automatically selects the best available runtime:
async function selectRuntime() {
if (navigator.gpu) {
try {
const adapter = await navigator.gpu.requestAdapter();
if (adapter) return 'webgpu';
} catch {}
}
return 'wasm';
}A badge in the UI indicates which runtime is active.
Trade-offs
WebGPU
| Pro | Con |
|---|---|
| Fast inference | Requires modern browser |
| Streaming | Higher power consumption |
| Larger models | Device must have GPU |
WASM
| Pro | Con |
|---|---|
| Universal support | Slower inference |
| Lower power | Smaller models only |
| Works everywhere | No streaming |
Model Selection
Model choice affects both approaches:
| Model Size | WebGPU | WASM |
|---|---|---|
| 0.5B | Fast | Usable |
| 1B | Good | Slow |
| 3B | Moderate | Impractical |
For field deployments, 0.5B models often provide the best trade-off.
Privacy Preserved
Because inference runs locally:
- Queries never leave the device
- No server logs of user input
- No API call traces
- Complete operational privacy
This matters enormously for humanitarian contexts with vulnerable populations.
Try It
UPAS is open source. See the documentation for setup instructions.
Wrap-up
Operational guidance shouldn't require constant connectivity. UPAS aims to work seamlessly — whether you're in a well-connected office or a remote field location.
If that sounds like the kind of tooling you want to explore — register your pilot interest or join the discussion on GitHub.