Performance
Performance optimisation strategies for UPAS
This guide covers performance optimisation for UPAS deployments.
Model Selection
Model choice significantly impacts performance:
| Model Size | Download | Memory | Inference Speed |
|---|---|---|---|
| 0.5B params | ~300MB | ~1GB | Fast |
| 1B params | ~600MB | ~2GB | Moderate |
| 3B params | ~1.5GB | ~4GB | Slow |
| 7B+ params | ~4GB+ | ~8GB+ | Very slow |
For field deployments, 0.5B–1B parameter models typically provide the best balance of capability and performance on mobile devices.
Quantisation
Model quantisation reduces size and memory usage:
- Q4: 4-bit quantisation, smallest, some quality loss
- Q8: 8-bit quantisation, larger, better quality
- F16: Half precision, largest, best quality
UPAS defaults to Q4 quantisation for WebLLM models.
First Load Optimisation
The first load is the slowest due to model download:
Progressive Loading
Show useful content while models load:
// Show UI immediately
renderAppShell();
// Load model in background
loadModel().then(() => {
enableAIFeatures();
});Download Progress
Display clear progress indicators:
const engine = await CreateMLCEngine(modelId, {
initProgressCallback: (progress) => {
updateProgressBar(progress.progress);
updateStatusText(progress.text);
},
});Chunked Downloads
WebLLM downloads models in shards. Ensure your CDN supports:
- Range requests
- Resumable downloads
- Parallel shard fetching
Runtime Performance
WebGPU Optimisation
For WebGPU runtime:
- Prefer devices with dedicated GPUs
- Close other GPU-intensive applications
- Ensure browser is up to date
WASM Optimisation
For WASM fallback:
- Use smaller models (WASM is CPU-bound)
- Reduce context length
- Limit concurrent operations
// Optimised WASM configuration
const wllama = new Wllama(wasmAssets);
await wllama.loadModelFromUrl(modelUrl, {
n_ctx: 1024, // Reduced context
n_batch: 128, // Smaller batches
});Cache Performance
Cache-First Strategy
Optimise cache hits:
async function cacheFirst(request) {
const cached = await caches.match(request);
if (cached) {
// Return cached immediately
return cached;
}
// Fetch and cache
const response = await fetch(request);
const cache = await caches.open(CACHE_NAME);
cache.put(request, response.clone());
return response;
}Preloading
Preload critical resources:
<link rel="preload" href="/app.js" as="script">
<link rel="preload" href="/styles.css" as="style">Memory Management
Monitor Memory Usage
Track memory during inference:
if (performance.memory) {
console.log('Heap used:', performance.memory.usedJSHeapSize);
console.log('Heap limit:', performance.memory.jsHeapSizeLimit);
}Cleanup
Release resources when not needed:
// Dispose of engine when done
engine.dispose();
// Clear unused caches
const cacheNames = await caches.keys();
for (const name of cacheNames) {
if (isOutdated(name)) {
await caches.delete(name);
}
}Core Web Vitals
Target metrics for UPAS:
| Metric | Target | Notes |
|---|---|---|
| LCP | < 2.5s | After cache warm-up |
| FID | < 100ms | UI should remain responsive |
| CLS | < 0.1 | Avoid layout shifts during load |
| TTFB | < 600ms | For cached responses |
Measuring Performance
Use the Performance API:
// Measure model load time
performance.mark('model-load-start');
await loadModel();
performance.mark('model-load-end');
performance.measure('model-load', 'model-load-start', 'model-load-end');
const measure = performance.getEntriesByName('model-load')[0];
console.log('Model load time:', measure.duration);Device-Specific Tuning
Mobile Devices
- Use smaller models
- Reduce batch sizes
- Implement aggressive caching
- Show clear loading states
Low-End Devices
- Prefer WASM with tiny models
- Consider degraded mode as default
- Minimise concurrent operations
Desktop/Laptop
- Can use larger models
- Enable WebGPU when available
- Allow longer context lengths
Next Steps
- Deployment — Production deployment
- Configuration — Runtime settings