Mozilla engineers have accelerated the Firefox AI Runtime by replacing its WebAssembly-based backend with a native C++ ONNX Runtime implementation.
This architectural shift has yielded performance gains of 2-10 times for on-device machine learning features, eliminating WASM warm-up overhead and using hardware-specific CPU instructions for faster model execution.
Mozilla explains the WASM bottleneck it faced for Firefox
The original architecture Mozilla used for Firefox AI features like Smart Tab Grouping and PDF.js alt-text was powered by Transformers.js, which uses onnxruntime-web—a WebAssembly (WASM) build of ONNX Runtime.
While functional, this approach presented several performance challenges:
- JS/WASM boundary overhead: A typical inference cycle involved multiple crossings between the JavaScript and WASM layers for pre-processing, model execution, and post-processing, introducing latency even with warm caches.
- Generic SIMD limitations: The key hotspot, matrix multiplications, was implemented in WASM using generic SIMD. This could not compete with the performance of hardware-specific intrinsics like NEON on Apple Silicon or AVX-512 on modern Intel chips.
- Warm-up tax: Each cold start of a feature incurred a JS/WASM warm-up penalty.
The team had previously seen success with native code in Firefox Translations, which uses WASM built-ins to call into C++. However, attempting to port the huge number of ONNX operators one-by-one was deemed “unmaintainable.”
Native C++ integration strategy
Mozilla opted for a full backend replacement for Firefox AI, made feasible by the “tiny surface” through which Transformers.js interacts with the ONNX Runtime.
The migration involved three main steps:
- Vendor ONNX Runtime C++ directly into the Firefox tree.
- Expose the C++ library to JavaScript via a thin WebIDL layer.
- Wire Transformers.js to the new native backend.
This approach ensured the change was completely transparent to the feature-level code, which still simply calls await pipeline(…).
To avoid bloating the main repository and slowing down builds, the ONNX Runtime source was not added in-tree. Instead, a configuration flag allows a pre-compiled version of the library to be downloaded from Mozilla’s Taskcluster CI system.
This required some upstream patches to ONNX to support building without exceptions and RTTI, and the build configuration was set to MinSizeRel with LTO to balance binary size and speed.
Quantifiable Firefox AI Runtime performance gains
Mozilla says the switch to native C++ yielded immediate benefits for Firefox AI features:
- PDF.js Alt-Text: The image-to-text model saw its latency fall from 3.5 seconds to just 350 ms on the same hardware.
- Smart Tab Grouping: For the topic model, cold start latency dropped from 1920.9 ms (WASM) to 532.2 ms (ONNX native). Warm inference time was reduced from 31.4 ms to 19.2 ms.

This new backend is being gradually rolled out, starting with Smart Tab Grouping in Firefox 142.
Mozilla’s future Firefox AI Runtime optimisation roadmap
With the C++ API now directly accessible, Mozilla is planning several further optimisations:
- Multi-threading DequantizeLinear: A patch has been developed to parallelise this frequently single-threaded operation across multiple cores, resulting in “an almost linear speedup.”
- Optimising matrix transposition: Naive nested for-loops are being replaced with a “multi-threaded cache-aware tiled transposition scheme” that uses SIMD, speeding up the operation by a supra-linear factor.
- Caching the compiled graph: For large models, compiling the model graph can take up to five seconds on every launch. Mozilla plans to cache the compiled graph to eliminate this start-up cost.
- GPU acceleration: The next major step is to integrate GPU-accelerated ONNX backends. This is a huge undertaking, as it “demands additional sandboxing to safely and securely interact with the underlying hardware.”
What is most interesting about this migration is how the Mozilla team delivered such a performance improvement while migrating Firefox AI features gradually and in complete isolation from the feature code itself. This architectural success not only makes current ML-based features more responsive and accessible to a wider audience but also establishes a solid foundation for the even more ambitious optimisations planned for the future.
(Photo by Rubaitul Azad)
See also: State of Python 2025: Web development makes a comeback

Want to experience the full spectrum of enterprise technology innovation? Join TechEx in Amsterdam, California, and London. Covering AI, big data, cybersecurity, IoT, digital transformation, intelligent automation, edge computing, and data centres, TechEx brings together global leaders to share real-world use cases and in-depth insights. Click here for more information.
TechHQ is powered by TechForge Media. Explore other upcoming enterprise technology events and webinars here.