Pilot programme targeting Q3 2026. Register your interest now.Pilot EOI
UPAS
Advanced

Performance

Performance optimisation strategies for UPAS

This guide covers performance optimisation for UPAS deployments.

Model Selection

Model choice significantly impacts performance:

Model SizeDownloadMemoryInference Speed
0.5B params~300MB~1GBFast
1B params~600MB~2GBModerate
3B params~1.5GB~4GBSlow
7B+ params~4GB+~8GB+Very slow

For field deployments, 0.5B–1B parameter models typically provide the best balance of capability and performance on mobile devices.

Quantisation

Model quantisation reduces size and memory usage:

  • Q4: 4-bit quantisation, smallest, some quality loss
  • Q8: 8-bit quantisation, larger, better quality
  • F16: Half precision, largest, best quality

UPAS defaults to Q4 quantisation for WebLLM models.

First Load Optimisation

The first load is the slowest due to model download:

Progressive Loading

Show useful content while models load:

// Show UI immediately
renderAppShell();

// Load model in background
loadModel().then(() => {
  enableAIFeatures();
});

Download Progress

Display clear progress indicators:

const engine = await CreateMLCEngine(modelId, {
  initProgressCallback: (progress) => {
    updateProgressBar(progress.progress);
    updateStatusText(progress.text);
  },
});

Chunked Downloads

WebLLM downloads models in shards. Ensure your CDN supports:

  • Range requests
  • Resumable downloads
  • Parallel shard fetching

Runtime Performance

WebGPU Optimisation

For WebGPU runtime:

  • Prefer devices with dedicated GPUs
  • Close other GPU-intensive applications
  • Ensure browser is up to date

WASM Optimisation

For WASM fallback:

  • Use smaller models (WASM is CPU-bound)
  • Reduce context length
  • Limit concurrent operations
// Optimised WASM configuration
const wllama = new Wllama(wasmAssets);
await wllama.loadModelFromUrl(modelUrl, {
  n_ctx: 1024,  // Reduced context
  n_batch: 128, // Smaller batches
});

Cache Performance

Cache-First Strategy

Optimise cache hits:

async function cacheFirst(request) {
  const cached = await caches.match(request);
  if (cached) {
    // Return cached immediately
    return cached;
  }
  
  // Fetch and cache
  const response = await fetch(request);
  const cache = await caches.open(CACHE_NAME);
  cache.put(request, response.clone());
  return response;
}

Preloading

Preload critical resources:

<link rel="preload" href="/app.js" as="script">
<link rel="preload" href="/styles.css" as="style">

Memory Management

Monitor Memory Usage

Track memory during inference:

if (performance.memory) {
  console.log('Heap used:', performance.memory.usedJSHeapSize);
  console.log('Heap limit:', performance.memory.jsHeapSizeLimit);
}

Cleanup

Release resources when not needed:

// Dispose of engine when done
engine.dispose();

// Clear unused caches
const cacheNames = await caches.keys();
for (const name of cacheNames) {
  if (isOutdated(name)) {
    await caches.delete(name);
  }
}

Core Web Vitals

Target metrics for UPAS:

MetricTargetNotes
LCP< 2.5sAfter cache warm-up
FID< 100msUI should remain responsive
CLS< 0.1Avoid layout shifts during load
TTFB< 600msFor cached responses

Measuring Performance

Use the Performance API:

// Measure model load time
performance.mark('model-load-start');
await loadModel();
performance.mark('model-load-end');
performance.measure('model-load', 'model-load-start', 'model-load-end');

const measure = performance.getEntriesByName('model-load')[0];
console.log('Model load time:', measure.duration);

Device-Specific Tuning

Mobile Devices

  • Use smaller models
  • Reduce batch sizes
  • Implement aggressive caching
  • Show clear loading states

Low-End Devices

  • Prefer WASM with tiny models
  • Consider degraded mode as default
  • Minimise concurrent operations

Desktop/Laptop

  • Can use larger models
  • Enable WebGPU when available
  • Allow longer context lengths

Next Steps