AI

Efficient Edge AI on Arm CPUs and NPUs: Understanding ExecuTorch through Practical Labs

Arm's Edge AI Initiative Gains Momentum with ExecuTorch, a PyTorch Extension for Local Inference on Constrained Devices. This new framework leverages Arm CPUs and NPUs to accelerate AI workloads, promising significant performance boosts on edge devices. Practical Labs, developed by Arm, provide a hands-on introduction to ExecuTorch's capabilities and potential applications in IoT and industrial automation.

ExecuTorch, a PyTorch extension for running AI models locally on edge devices, now has a set of hands-on Jupyter labs created by Arm. The labs walk through deploying models on Arm CPUs and Ethos-U NPUs, covering both the practical steps and the underlying reasoning.

Overview

ExecuTorch extends the PyTorch ecosystem to deliver local AI inference on constrained edge devices. It takes a PyTorch model, exports it into a lightweight .pte file containing both the model weights and a static computation graph, and runs it through a runtime built specifically for edge inference. This removes the need for Python at runtime and avoids dynamic execution overhead that is unnecessary for inference.

Arm has created a collection of Jupyter labs that complement the official ExecuTorch documentation. The labs explain both the how and the why of each step, covering CPU and NPU inference across Cortex-A and Cortex-M + Ethos-U platforms. They also showcase the use of Model Explorer adapters, developed by Arm, to gain visibility into model deployment with ExecuTorch.

What ExecuTorch Does

ExecuTorch takes a PyTorch model, exports it into a minimal .pte artefact containing both the model weights and a static computation graph. This removes the need for Python at runtime and avoids dynamic execution overhead that is unnecessary for inference. The export step is followed by lowering, where the model graph is transformed into a backend-compatible form. This is where hardware-aware optimization begins.

The resulting artefact is lightweight, portable, predictable in execution, and suitable for deployment on constrained systems.

CPU Inference: Raspberry Pi 5

Even on devices like the Raspberry Pi 5, which can run PyTorch models without needing ExecuTorch, performance improvements can be found through using ExecuTorch. Performance depends heavily on how the model is executed. ExecuTorch achieves performance by delegating parts of the model to optimized backends. On Arm CPUs, this is typically done using the XNNPACK backend. When enabled, supported operators—such as convolutions and matrix multiplications—are delegated to highly optimized implementations. On Arm platforms, these implementations leverage KleidiAI microkernels, which make efficient use of architectural features such as Neon.

In the labs, Arm compares inference of an OPT-125M transformer model on a Raspberry Pi 5. The results show a significant latency reduction when using ExecuTorch with XNNPACK. It's important to note that backend delegation doesn't occur by default. Running ExecuTorch without XNNPACK will often result in higher latency compared to PyTorch (which has its own KleidiAI optimizations), though you still benefit from a reduced runtime footprint and improved portability.

NPU Inference: Ethos-U and TOSA

To go further, the labs target hardware acceleration using Arm Ethos-U NPUs, typically paired with Cortex-A or Cortex-M CPUs. Execution becomes heterogeneous. Rather than running the entire model on one processor, ExecuTorch partitions the graph: supported subgraphs are delegated to the NPU, and unsupported operators fall back to the CPU.

Ethos-U operates on quantized integer models (typically INT8), so models must be quantized before delegation. The first step is to create a quantizer specific to the backend using EthosUQuantizer and a compile_spec matching your specific target Ethos-U. For example, the Ethos-U targeted in the labs is an Ethos-U85 with 256 multiply-accumulate (MAC) units.

The next step involves lowering the model into TOSA (Tensor Operator Set Architecture), an intermediate representation designed to bridge high-level frameworks and hardware backends. TOSA provides a stable, hardware-agnostic operator set. Instead of requiring each hardware vendor to support every framework-specific operator, models are lowered into TOSA, and hardware backends implement this smaller, standardized set.

This step uses the to_edge_transform_and_lower API, specifying use of the EthosUPartitioner. For Ethos-U this triggers the backend path that serializes to TOSA and runs Vela to produce an optimized command stream for execution on the NPU. Finally, .to_executorch(...) packages the result into a .pte file.

Visualizing Model Deployment

To make the partitioning visible, the labs utilize Google's Model Explorer, along with adapters developed by Arm. These tools allow you to inspect the ExecuTorch graph (.pte) and visualize how it is partitioned across backends, and examine the TOSA representation (.tosa).

For example, the labs compare two .pte files targeting the same Ethos-U configuration, but generated from slightly different models. A regular MobileNetV2 model contains only supported operators, allowing the entire compute region to be delegated as a single, continuous Ethos-U subgraph. A MobileNetV2 model with an additional LRN layer inserted is decomposed into lower-level operations during lowering. Not all of these operations can be delegated, and the graph is partitioned into multiple segments. Supported regions are delegated to the NPU, while the unsupported portion runs on the CPU.

This level of visibility helps explain performance behavior and can guide optimization decisions.

Practical Next Steps

The Jupyter labs are designed so you can run and modify the code on your own hardware. The collection includes contributions from Professor Marcelo Rovai (UNIFEI University, and a member of the Edge AI Foundation Academia-Industry Partnership) and academic reviewers at IIIT Bangalore.

Building models is only half the story—getting them running efficiently at the edge is what matters. ExecuTorch makes that possible, and these labs show you how to get started quickly while understanding the underlying concepts.

Similar Articles

More articles like this

AI 1 min

Universal AI is “a pathway to AI fluency that’s accessible and approachable to anyone, anywhere”

MIT’s new AI literacy push—backed by a free, adaptive course and real-time LLM tutors—slashes the barrier to entry for non-technical learners, embedding generative models as both subject and instructor. By offloading scaffolding to AI agents, the program turns passive video lectures into interactive, Socratic dialogues that scale from K-12 classrooms to corporate upskilling, potentially minting millions of “AI-fluent” users within a year.

AI 1 min

Building Blocks for Foundation Model Training and Inference on AWS

AWS has quietly commoditized the full-stack LLM pipeline, rolling out pre-configured EC2 UltraClusters, Trainium2/Inferentia3 instances, and a managed Neuron SDK that slashes training costs by 40% while hitting 1.6 exaFLOPS per cluster. By bundling optimized PyTorch/XLA containers and direct S3-to-accelerator data paths, the platform now lets startups replicate Meta’s Llama 3 training runs without bespoke infrastructure—reshaping the economics of open-weight model development.

AI 1 min

How ChatGPT adoption broadened in early 2026

Mainstream AI adoption gains momentum as Q1 2026 data reveals a significant surge in ChatGPT usage, driven by a 35% increase in adoption among users over 35 and a notable shift towards more balanced gender demographics, with women now comprising 52% of new users. This trend suggests a widening appeal beyond tech-savvy demographics, as the platform's user base expands to include a broader, more diverse audience.

AI 1 min

How enterprises are scaling AI

As enterprises push AI beyond proof-of-concept, they're discovering that scaling requires more than just throwing compute power at the problem – it demands a holistic approach that integrates trust frameworks, data governance, and workflow orchestration to ensure high-quality, explainable models can be deployed at scale, with a recent study citing a 300% increase in model accuracy after implementing a robust data validation pipeline.

AI 1 min

The new AI-powered Google Finance is expanding to Europe.

Google’s AI-driven Finance overhaul—powered by real-time entity extraction and multimodal summarization—debuts across Europe this week, replacing static stock tickers with dynamic, localized briefings in 24 languages. The revamped interface ditches legacy RSS feeds for a Gemini-infused pipeline that surfaces earnings call snippets, macroeconomic trends, and portfolio anomalies, effectively turning a decade-old utility into a personalized financial copilot.

AI 1 min

MachinaCheck: Building a Multi-Agent CNC Manufacturability System on AMD MI300X

A novel multi-agent system leveraging heterogeneous computing on AMD's MI300X GPU accelerators is poised to revolutionize CNC manufacturability by integrating real-time machine learning, computer vision, and process control. By harnessing the MI300X's 2,048 AMD Zen 4 CPU cores and 1,536 AMD RDNA 3 GPU cores, the system achieves unprecedented throughput and precision in complex part fabrication. This breakthrough has significant implications for high-speed, high-precision manufacturing.