What's the deal with NPUs?

Table of Contents

Introduction

From large organizations utilizing tens of thousands of graphics cards working in tandem to produce new pre-trained models, to hobbyists at home using generative AI with off-the-shelf hardware, the modern graphics card has emerged as the engine powering the surge of AI tools and applications. Although GPUs are well-suited for these tasks for a multitude of reasons, they are not a “one size fits all” solution to the exploding variety of workloads built upon neural networks. In particular, due to their energy and cooling requirements, powerful GPUs are not feasible for lower-power devices such as smartphones, tablets, and notebooks. On the other hand, while CPUs are undoubtedly capable of performing the calculations required for machine learning workloads, their performance is typically inadequate for all but the smallest of models.

As a result, a growing demand for specialized hardware to execute these workloads has emerged, leading to the inclusion of AI accelerators into a variety of computing devices. The industry seems to have settled on the term “Neural Processing Unit” or “NPU” to describe these accelerators. Put simply, NPUs represent dedicated hardware that’s designed to quickly and efficiently perform the kinds of calculations (e.g. the MAC operation) at the heart of today’s AI workloads.

NPUs, the hot “new” thing?

Despite the recent buzz around NPUs, hardware manufacturers have been integrating these types of accelerators into their products for years. Qualcomm’s “Hexagon NPU” was iterated upon and rebranded from their “Hexagon DSP” (Digital Signal Processor), the latter being a part of the Snapdragon architecture since it was announced back in 2006. Although most examples of NPUs and similar hardware can be found within the mobile ecosystem, where there is a strong incentive to offload applicable workloads to specialized co-processors to preserve these devices’ limited battery life, there are exceptions. Intel introduced their “Gaussian & Neural Accelerator” or “GNA” with the launch of their mobile 10th-generation CPUs, and since 11th-gen, both their mobile and desktop “Core” CPU lines still retain this feature to this day.

Yet for devices that do not rely on a battery, such as a desktop workstation, the incentive to include an NPU is greatly diminished since most of these devices have other hardware (such as GPUs) that can do the job. On top of that, most consumer applications that are candidates for acceleration by an NPU have been developed for mobile devices rather than desktops or laptops. This has led to a “chicken and egg” problem for manufacturers because, without robust software support or potential gains in battery life, there hasn’t been much incentive to devote resources to improving support for AI workloads outside of smartphones and tablets.

Microsoft to the rescue?

Back in May, Microsoft announced the “CoPilot+ PC” initiative to support the introduction of locally-run AI tools they are currently implementing within Windows. Not to be confused with Microsoft Copilot itself, the cloud-based AI assistant introduced as a replacement for Cortana, “Copilot+ PC” is a new category of PCs defined by Microsoft. The main requirement for a system to qualify as a “Copilot+ PC” is to contain hardware that achieves at least 40 TOPS (Trillion Operations per Second [INT8]). In addition to defining this new category, Microsoft is also introducing several new features to Windows that take advantage of this added compute capability. As a result of these efforts, we are seeing a wave of hardware being released that aims to meet this new 40 TOPS standard.

Regardless of this push from Microsoft, manufacturers still don’t appear to have much appetite to beef up the NPUs for CPUs intended for desktop workstations. For example, although all of Intel’s Lunar Lake (Mobile) processors are advertised as achieving a minimum of 40 TOPS, their upcoming Arrow Lake (Desktop) processors retain the same NPU from the older Meteor Lake line, with a modest 13 TOPS. AMD has released some mobile processors containing NPUs, which are advertised to reach up to 50 TOPS, even going so far as to brand them as “AMD Ryzen AI” processors. However, like Intel, AMD offers only a handful of desktop CPUs featuring NPUs, and they provide a similar level of performance to the Intel desktop offerings with a maximum of 16 TOPS. For now, this means that Intel and AMD don’t appear to be aiming for the “Copilot+ PC” certification with their desktop CPU offerings.

That said, it does appear that software support for NPUs is gaining traction, and we may soon see rapid expansion of NPU support by developers looking to attract “Co-Pilot PC+” owners. For example, Blackmagic Design’s “Neural Engine” tools found in DaVinci Resolve 19 now support the NPU in Qualcomm’s new “Snapdragon X” series processors. Adobe is also actively working on expanding support for these processors, but that work is more for the ARM architecture in general rather than for NPU support specifically.

Chart showing all SKUs of Intel Core Ultra 200S Series desktop CPUs feature 13 TOPS NPUs

Why not include NPUs in desktop systems?

First, although energy efficiency is a worthwhile goal for any computer hardware, it is simply not valued as highly on systems that don’t rely on battery power, particularly on powerful workstations where performance supersedes all or most other concerns. Secondly, it’s much harder to justify taking up valuable chip real estate on an NPU for a workstation CPU when most users will already have access to a GPU that provides far more than 40 TOPS of performance. For example, NVIDIA’s documentation states that RTX GeForce 4090 can reach a peak of 660.6 INT8 TOPS, and even the now six-year-old NVIDIA RTX 2080 Ti offers nearly 230 TOPS!

Conclusion

In the near future, we should not expect to see much change being driven by NPUs in the desktop workstation ecosystem. This is because the benefits of more efficient computations to battery life simply do not translate to the desktop ecosystem, and because dedicated GPUs offer many times the raw performance of integrated NPUs. As time passes and more applications are developed that can take advantage of NPU hardware, we will continue to see NPUs integrated into more devices, most notably laptops in the near term.