Performance

This guide covers performance optimisation for UPAS deployments.

Model Selection

Model choice significantly impacts performance:

Model Size	Download	Memory	Inference Speed
0.5B params	~300MB	~1GB	Fast
1B params	~600MB	~2GB	Moderate
3B params	~1.5GB	~4GB	Slow
7B+ params	~4GB+	~8GB+	Very slow

For field deployments, 0.5B–1B parameter models typically provide the best balance of capability and performance on mobile devices.

Quantisation

Model quantisation reduces size and memory usage:

Q4: 4-bit quantisation, smallest, some quality loss
Q8: 8-bit quantisation, larger, better quality
F16: Half precision, largest, best quality

UPAS defaults to Q4 quantisation for WebLLM models.

First Load Optimisation

The first load is the slowest due to model download:

Progressive Loading

Show useful content while models load:

// Show UI immediately
renderAppShell();

// Load model in background
loadModel().then(() => {
  enableAIFeatures();
});

Download Progress

Display clear progress indicators:

const engine = await CreateMLCEngine(modelId, {
  initProgressCallback: (progress) => {
    updateProgressBar(progress.progress);
    updateStatusText(progress.text);
  },
});

Chunked Downloads

WebLLM downloads models in shards. Ensure your CDN supports:

Range requests
Resumable downloads
Parallel shard fetching

Runtime Performance

WebGPU Optimisation

For WebGPU runtime:

Prefer devices with dedicated GPUs
Close other GPU-intensive applications
Ensure browser is up to date

WASM Optimisation

For WASM fallback:

Use smaller models (WASM is CPU-bound)
Reduce context length
Limit concurrent operations

// Optimised WASM configuration
const wllama = new Wllama(wasmAssets);
await wllama.loadModelFromUrl(modelUrl, {
  n_ctx: 1024,  // Reduced context
  n_batch: 128, // Smaller batches
});

Cache Performance

Cache-First Strategy

Optimise cache hits:

async function cacheFirst(request) {
  const cached = await caches.match(request);
  if (cached) {
    // Return cached immediately
    return cached;
  }
  
  // Fetch and cache
  const response = await fetch(request);
  const cache = await caches.open(CACHE_NAME);
  cache.put(request, response.clone());
  return response;
}

Preloading

Preload critical resources:

<link rel="preload" href="/app.js" as="script">
<link rel="preload" href="/styles.css" as="style">

Memory Management

Monitor Memory Usage

Track memory during inference:

if (performance.memory) {
  console.log('Heap used:', performance.memory.usedJSHeapSize);
  console.log('Heap limit:', performance.memory.jsHeapSizeLimit);
}

Cleanup

Release resources when not needed:

// Dispose of engine when done
engine.dispose();

// Clear unused caches
const cacheNames = await caches.keys();
for (const name of cacheNames) {
  if (isOutdated(name)) {
    await caches.delete(name);
  }
}

Core Web Vitals

Target metrics for UPAS:

Metric	Target	Notes
LCP	< 2.5s	After cache warm-up
FID	< 100ms	UI should remain responsive
CLS	< 0.1	Avoid layout shifts during load
TTFB	< 600ms	For cached responses

Measuring Performance

Use the Performance API:

// Measure model load time
performance.mark('model-load-start');
await loadModel();
performance.mark('model-load-end');
performance.measure('model-load', 'model-load-start', 'model-load-end');

const measure = performance.getEntriesByName('model-load')[0];
console.log('Model load time:', measure.duration);

Device-Specific Tuning

Mobile Devices

Use smaller models
Reduce batch sizes
Implement aggressive caching
Show clear loading states

Low-End Devices

Prefer WASM with tiny models
Consider degraded mode as default
Minimise concurrent operations

Desktop/Laptop

Can use larger models
Enable WebGPU when available
Allow longer context lengths

Next Steps

Deployment — Production deployment
Configuration — Runtime settings

Performance

On this page