NPUs vs. TPUs: How On-Device AI Is Supercharging Your Gadgets in 2025

In a nutshell: Your smartphone, camera, and even your car are getting AI brains built-in – no cloud required. Special chips called NPUs (Neural Processing Units) and TPUs (Tensor Processing Units) are transforming everyday devices into intelligent assistants capable of face recognition, voice commands, real-time translation, autonomous driving features and more. This on-device AI revolution promises lightning-fast responses, better privacy, and new features we once thought only possible with supercomputers. In this report, we’ll demystify NPUs and TPUs, see how they differ from CPUs/GPUs, and explore why tech giants like Apple, Google, Qualcomm, and Intel are racing to put these “AI brains” into everything from phones to cars. We’ll also highlight the latest 2024–2025 breakthroughs, expert insights, industry standards, and what the future holds for on-device AI.

What Are NPUs and TPUs? (Meet Your Device’s AI Brain)

Neural Processing Units (NPUs) are specialized processors designed to accelerate artificial neural networks – the algorithms that power modern AI tasks like image recognition, speech processing, and more. Unlike general-purpose CPUs, NPUs are application-specific integrated circuits (ASICs) tuned for matrix math and the heavy parallel workloads of neural networks techtarget.com. An NPU “mimics the neural networks of a human brain to accelerate AI tasks,” essentially acting as a silicon brain inside your device techtarget.com. NPUs excel at running inference (making predictions) for AI models efficiently on-device, often using lower numerical precision (e.g. 8-bit integers) to save power while still delivering high performance backblaze.com. The term “NPU” is sometimes used broadly for any AI accelerator, but it more commonly refers to those in mobile and edge devices backblaze.com. For instance, Apple’s “Neural Engine” in iPhones and Samsung’s mobile AI engine are NPUs integrated into their system-on-chip (SoC) designs.

Tensor Processing Units (TPUs), on the other hand, were originated by Google as custom chips for accelerating machine learning, especially for the TensorFlow framework. A TPU is a type of ASIC optimized for the tensor operations (matrix multiplications, etc.) at the heart of neural network training and inference backblaze.com. Google first deployed TPUs in its data centers in 2015 to speed up neural network computations, and later made them available via Google Cloud backblaze.com. TPUs use a distinct architecture called a systolic array, which links many small processing units in a grid that pumps data through a chain of matrix multiply units backblaze.com. This design achieves extreme throughput on deep learning tasks. Google’s TPUs deliberately trade off some precision (using 8-bit or 16-bit math instead of 32-bit floats) for massive speed and efficiency gains backblaze.com, since many AI tasks don’t require high precision to get accurate results. While “TPU” technically refers to Google’s chips, the term is sometimes used more generically for any “tensor” accelerator. Notably, Google also produces Edge TPU co-processors for on-device AI in products like the Coral Dev Board, delivering 4 trillion operations per second in a few watts coral.ai.

In short: NPUs and TPUs are both silicon accelerators for AI, but NPUs are commonly built into mobile/edge devices for efficient on-device inference, whereas TPUs (in the strict sense) have been high-performance chips (and now modules) primarily from Google, originally for cloud/datacenter training and inference tasks. Both depart from traditional CPU/GPU designs to prioritize parallel math operations for neural networks. As one tech editor put it, “TPUs take specialization further, focusing on tensor operations to achieve higher speeds and energy efficiencies… NPUs are prevalent in AI-enabled devices like smartphones and IoT gadgets” backblaze.com.

How Are NPUs and TPUs Different from CPUs and GPUs?

Traditional CPUs (central processing units) are the “brains” of general computing – optimized for flexibility to handle all sorts of tasks, from running your operating system to browsing the web. They have a few powerful cores that excel at sequential logic and varied instructions, but they are not great at the highly parallel math crunching needed for deep learning techtarget.com. When a CPU is asked to process a large neural network, it often becomes a bottleneck, trying to execute millions of multiplications and additions in sequence or limited parallel batches. This leads to high latency and power consumption (the so-called Von Neumann bottleneck of shuttling lots of data between CPU and memory) backblaze.com. CPUs can do some AI work (especially simpler or smaller models, or control logic for AI programs techtarget.com), but as a rule, they struggle to efficiently scale to modern AI’s demands of massive parallel linear algebra.

GPUs (graphics processing units) brought parallel computing to the forefront. Originally made for rendering images by performing many simple operations in parallel on pixels and vertices, GPUs turned out to be well-suited for training neural networks, which also involve applying the same math operations (dot products, etc.) across lots of data simultaneously techtarget.com. A GPU contains hundreds or thousands of small cores that can perform math in parallel. This makes GPUs excellent for large-scale AI, and throughout the 2010s GPUs (especially NVIDIA’s with CUDA software) became the workhorse of deep learning research. However, GPUs are still somewhat general – they have to handle various graphics tasks and maintain flexibility, so they aren’t 100% optimized for neural nets. They also consume a lot of power and require careful programming to fully utilize (they dislike complex branching code and thrive on streamlined, data-parallel tasks) techtarget.com.

NPUs and TPUs take the specialization even further. They are purpose-built for just the neural net workload. This means their architecture can strip out anything not needed for AI math and devote more silicon to things like matrix multiply units, accumulation adders, and on-chip memory for rapidly shuttling data in and out of those math units. A Google Cloud TPU, for example, is essentially a giant 2D array of MAC (multiply-accumulate) units with a clever dataflow architecture (the systolic array) feeding them with operands at high speed backblaze.com. It doesn’t bother with caches, speculative execution, or other CPU features – it’s streamlined for matrix math. NPUs in mobile chips similarly integrate dedicated neural engine cores alongside the CPU/GPU. These cores often use low-precision arithmetic (e.g. 8-bit integers like TPUs) and run highly parallel “layer by layer” computations for things like convolutional neural networks. An NPU might use a “fused” architecture combining scalar, vector, and tensor units (Qualcomm’s Hexagon NPU does this) to handle different neural network operations efficiently futurumgroup.com.

The key differences come down to:

Instruction set and flexibility: CPUs have a broad, general instruction set (can do many things, but not all simultaneously). GPUs have a more limited but still flexible instruction set optimized for throughput on math. NPUs/TPUs have a very narrow instruction set – essentially just the operations needed for neural nets (matrix mult, convolution, activation functions), often implemented as fixed pipelines or arrays fuse.wikichip.org. For example, Tesla’s self-driving NPU has only 8 instructions in its ISA, focused on DMA reads/writes and dot products fuse.wikichip.org.
Parallelism and cores: CPUs = a few powerful cores; GPUs = thousands of simple cores; NPUs/TPUs = in a sense, tens of thousands of very simple ALUs (the MAC units) structured in a matrix or neural network fashion. A single NPU chip might perform tens of trillions of operations per second – Tesla’s car NPU runs at 2 GHz with 9,216 MACs, achieving ~37 tera-operations per second (TOPS) per core, and each FSD chip has two NPUs for ~74 TOPS fuse.wikichip.org, ts2.tech. By contrast, a high-end CPU might reach only a few hundred billion ops/sec on AI tasks, and a GPU maybe a few TOPS if not using special tensor cores.
Memory architecture: NPUs/TPUs rely on fast on-chip memory and streaming of data. TPUs avoid the classic memory bottleneck by using systolic dataflow – each small unit passes data to the next in lockstep, minimizing reads/writes to main memory backblaze.com. Many NPUs include chunk of SRAM on-chip for weights/activations (e.g. Tesla’s NPU cores have 32 MB SRAM each to hold neural network data locally) semianalysis.com. This contrasts with GPUs/CPUs which use external DRAM heavily.
Precision: CPUs/GPU usually do 32-bit or 64-bit floats for compute. AI accelerators often use 16-bit or 8-bit integers (and some now explore 4-bit or even 2-bit) because neural nets tolerate lower precision. Google’s TPU designers explicitly noted you don’t need full float precision for inference, analogous to “you don’t need to know exactly how many raindrops are falling to know it’s raining heavily” backblaze.com. This allows NPUs/TPUs to do more operations in parallel and use less energy per operation.
Use cases: GPUs are still widely used for training large models and for flexible computing (and they are common in data centers and high-end PCs). TPUs (cloud) aim at large-scale training and inference in Google’s ecosystem. NPUs are more often found in edge devices – smartphones, cameras, appliances – doing inference on already-trained models. They shine in tasks like applying a vision model to a camera frame in real time, or running a voice assistant wake-word detection continuously at low power. As TechTarget noted: “GPUs are chosen for availability and cost-effectiveness in many ML projects; TPUs are usually faster and less precise, used by businesses on Google Cloud; NPUs are commonly found in edge/mobile devices for significantly faster local processing” techtarget.com.

In summary, CPUs = versatile organizers, GPUs = parallel workhorses, TPUs/NPUs = neural net specialists. All can cooperate – in fact, in a modern AI-enabled device, the CPU often coordinates tasks and offloads the math-heavy parts to the NPU/GPU as needed techtarget.com. This specialization trend exists because one size no longer fits all in computing: as one editor quipped, “adding millions more transistors for every need wasn’t good for efficiency… designers embraced purpose-built processors” techtarget.com. Purpose-built NPUs and TPUs drastically speed up AI computations while keeping power low – a critical balance for battery-powered devices and high-density servers alike.

Why On-Device AI? (Edge vs. Cloud)

Why bother running AI on your phone or car at all – why not just send everything to the cloud where giant servers (with GPUs/TPUs) can do the heavy lifting? There are several compelling reasons driving the shift to on-device AI, and they boil down to speed, privacy, cost, and reliability nimbleedge.com:

Instant Response (Low Latency): An on-device NPU can process data in real time without the round-trip delay of sending data to a cloud server. This is crucial for interactive or safety-critical AI tasks. For example, a car’s autonomous driving system using onboard NPUs can identify a pedestrian and hit the brakes immediately, within milliseconds, rather than waiting for cloud computation. A smart camera with an NPU can detect an intruder the moment they appear in frame. On your phone, on-device AI means your voice assistant can respond faster and more naturally because it’s not constantly “phoning home.” Reduced latency enables true real-time decision-making and a smoother user experience nimbleedge.com.
Privacy and Data Security: On-device AI keeps your data local. Instead of streaming your microphone audio or camera feed to the cloud for analysis, the processing happens within the device. This greatly reduces exposure of sensitive data. For instance, modern smartphones perform face recognition (Face ID, etc.) entirely on-device – your face’s biometric map never leaves the phone’s secure enclave. Similarly, an AI hearing aid or health wearable can analyze biometric data without uploading it to any server, preserving privacy. Given growing user concerns and regulations about data sovereignty, this is a major advantage. As one edge AI blog put it, on-device processing means “user data does not need to be transmitted to the cloud,” providing a baseline privacy benefit nimbleedge.com. (Of course, privacy isn’t automatic – developers still must handle stored data carefully – but it’s easier to trust devices that aren’t constantly sending your info out.) Tech CEOs often emphasize this angle. Qualcomm’s CEO Cristiano Amon noted that combining cloud and on-device intelligence can enhance personalization while keeping data secure on the device – he calls it a “hybrid future” where on-device AI collaborates with cloud AI for the best of both moomoo.com.
Offline Availability & Reliability: Devices with NPUs/TPUs don’t depend on connectivity. They can work in a subway tunnel, on an airplane, in remote rural areas, or during network outages. This is huge for reliability. A voice dictation feature on-device will still work with no signal. A drone with on-board vision AI can avoid obstacles even off-grid. This independence is also critical for mission-critical systems: e.g. disaster recovery robots or medical devices that cannot assume a live internet connection. “Offline functionality” is a core advantage of on-device AI nimbleedge.com – it ensures the AI feature is available whenever and wherever needed.
Cost Efficiency at Scale: Constantly sending raw data to the cloud for AI processing can be very costly (cloud compute isn’t free) and bandwidth-intensive. As AI features proliferate, companies would have to foot enormous cloud processing bills if every little task hit a server. By doing more on the edge, they reduce cloud server loads and network usage. It’s often more efficient to spend a few extra dollars on a better chip in the device than to pay for gigabytes of cloud computing over the device’s lifetime. A Futurum industry analysis noted that on-device processing helps address generative AI’s scaling and cost issues – it “spreads out” the load so data centers aren’t overwhelmed (and users/developers aren’t paying through the nose for cloud GPU time) futurumgroup.com.
Personalization & Context: An emerging reason: on-device AI can learn from and adapt to local context in a way cloud AI might not. Your smartphone can maintain a tiny local model that learns your texting style for better autocorrect, without sharing that personal language model to the cloud. Devices can fuse data from multiple sensors in real time (something easier to do locally than streaming a bunch of sensor feeds to cloud). This can enable a more personalized and context-aware experience. Some features like federated learning even allow devices to improve AI models collaboratively without uploading raw data (only sending back small weight updates).
Regulatory and Data Sovereignty: Laws like Europe’s GDPR and various data localization requirements increasingly mandate that certain data (especially personal or sensitive data) not be sent offshore or to third parties without consent. On-device AI offers a way to comply by processing data at the source. For example, medical imaging AI tools can run on hospital hardware (edge servers with NPUs) so patient data never leaves the premises, addressing privacy regulations. NimbleEdge’s 2025 report points out governments pushing for more local inferencing for sovereignty and compliance reasons nimbleedge.com.

All these factors are driving a paradigm shift: instead of thinking “cloud-first” for AI, companies now design AI features “device-first” when possible. As Qualcomm’s AI VP, Durga Malladi, summed up: “To effectively scale generative AI into the mainstream, AI will need to run on both the cloud and devices at the edge… such as smartphones, laptops, vehicles, and IoT devices” iconnect007.com. We’re moving to a hybrid AI world where heavy training and big models might live in the cloud, but many inference tasks and personal AI experiences run locally on the NPUs/TPUs in your hands and homes. In fact, Amon calls it a “turning point in AI” – on-device inference with no latency, where “the future of AI is personal” because it runs right where you are x.com.

On-Device AI in Action: From Smartphones to Self-Driving Cars

Specialized AI chips are already embedded in a wide range of devices around you, often invisibly making them smarter. Here are some major arenas where NPUs and edge TPUs are deployed:

Smartphones & Tablets: Nearly all modern flagship phones (and even many mid-range ones) now include an NPU or dedicated AI engine. Apple kicked off the trend in 2017 with the Apple Neural Engine in the iPhone’s A11 chip, enabling on-device Face ID and Animoji by performing up to 600 billion ops/sec apple.fandom.com. Today, Apple’s A17 Pro chip (2023) packs a 16-core Neural Engine capable of 35 trillion operations per second apple.fandom.com. This powers features like advanced camera scene detection, photo styles, Siri voice commands processed offline, autocorrect, live transcription, and even running transformer models for translation on-device. Google’s Pixel phones likewise have custom silicon (“Google Tensor” SoCs) with NPUs: the latest Tensor G3 in Pixel 8 was “custom-designed to run Google’s AI models”, upgrading every part of the chip (CPU, GPU, ISP) to pave the way for on-device generative AI blog.google. Pixel 8 can run Google’s cutting-edge text-to-speech and translation models locally, the same ones previously confined to data centers blog.google. It also performs complex camera tricks like the “Best Take” group photo merger and Audio Magic Eraser using a suite of AI models on-device blog.google. Samsung and other Android OEMs use Qualcomm’s Snapdragon chipsets, whose latest NPUs (Hexagon AI engine) can even run large language models on the phone – Qualcomm demonstrated running a 10 billion-parameter LLM and even Stable Diffusion image generation on a phone with Snapdragon 8 Gen 3 futurumgroup.com. This chip’s AI engine is 98% faster than the last generation and supports INT4 precision for efficiency futurumgroup.com. Practical upshot: your 2024 phone can do things like summarizing articles, answering questions, or editing photos with AI without needing the cloud. Even accessibility features benefit: e.g. Pixel phones now have on-device voice typing, live captions, and an upcoming feature to describe images to blind users using a local model.
Smart Cameras & Security Systems: AI-enabled cameras use on-board NPUs to detect people, faces, animals, or suspicious behavior instantly. For example, EnGenius’ latest security cameras include a built-in NPU that handles object detection and converts video to metadata right on the camera, eliminating the need for a separate video recorder and boosting security (since video can be analyzed and stored locally) engeniustech.com. This means your security cam can decide “person present” or “package delivered” and send just that alert, instead of streaming hours of footage to a cloud service. Similarly, consumer devices like the Google Nest Cam IQ had an on-device vision chip (Google Edge TPU) to recognize familiar faces and differentiate humans vs. pets in its field of view. DSLR and mirrorless cameras are also adding AI processors for things like subject tracking, eye autofocus, and scene optimization in real time. In drones, onboard AI chips help with obstacle avoidance and visual navigation without requiring remote control. Notably, Google’s Edge TPU (a tiny ASIC module) has become a popular add-on for DIY and industrial IoT cameras – it provides 4 TOPS of vision processing power for tasks like detecting people or reading license plates, while using only ~2 watts coral.ai.
Smart Home & IoT Devices: Beyond phones, many smart home gadgets have mini NPUs. Voice-activated speakers (Amazon Echo, Google Nest Hub, etc.) now often include local speech recognition chips. Amazon developed the AZ1 Neural Edge processor for the Echo devices to speed up Alexa’s wake word detection and responses on the device, cutting the latency in half embedl.com. The AZ1 (built with MediaTek) runs a neural network that recognizes “Alexa” and processes simple commands without reaching the cloud embedl.com. This not only makes Alexa feel faster but also keeps more voice data private. Likewise, many new TVs, appliances, and even toys have some AI at the edge – e.g. a smart refrigerator’s camera can identify foods and expiration dates locally. Wearables deserve mention too: the Apple Watch’s S9 chip added a 4-core Neural Engine to better handle health AI algorithms and Siri requests on-watch apple.fandom.com. And on the industrial side, IoT sensors with NPUs can perform anomaly detection on equipment data right at the edge, only flagging the relevant events upstream (saving bandwidth and responding faster to issues).
Automobiles (ADAS and Autonomy): Cars have become AI hubs on wheels. Advanced driver-assistance systems (ADAS) and self-driving features rely on a suite of onboard AI accelerators to interpret camera feeds, LiDAR, radar, and make driving decisions in a split second. Tesla famously designed its own FSD (Full Self-Driving) Computer with dual NPU chips. Tesla’s FSD chip (HW3, introduced 2019) provided 144 TOPS (two NPUs at 72 TOPS each); the newer HW4 (2023) ups that to roughly 200–250 TOPS total (two 7nm NPUs around 100+ TOPS each) ts2.tech. This enables the car to process full-resolution video from 8 cameras, sonar, etc., simultaneously through neural networks for perception and even run some language models for voice commands – all locally inside the car’s module. Competing platforms like NVIDIA Drive and Qualcomm Snapdragon Ride also integrate NPUs. NVIDIA’s latest car supercomputer chip, Drive Thor, slated for 2025 cars, boasts up to 1,000 TOPS on a single chip (and 2,000 TOPS when two are paired) to support Level 4 autonomy ts2.tech. It combines a GPU, CPU, and dedicated deep learning accelerators so it can handle everything from road sign recognition to driver monitoring AI on the chip ts2.tech. These NPUs are literally life-saving: an autonomous car can’t wait for cloud servers if a child runs into the street. Onboard AI must see and react within tens of milliseconds. Outside of passenger cars, you also find heavy use of edge AI in autonomous drones, delivery robots, and industrial vehicles which navigate and make decisions with on-board NPUs/TPUs (for example, Nuro’s delivery robots and many self-driving trucking systems use NVIDIA or Huawei AI chips on-device).
Edge Computing & Industry: In factories and enterprise settings, on-device AI often takes the form of edge servers or gateways with AI accelerators. Instead of sending camera feeds or sensor data to a central cloud, companies install edge boxes (sometimes GPU-based, sometimes NPU/FPGA-based) on premises. These handle tasks like real-time video analytics for quality control on a production line, detecting defects using AI vision in microseconds. Healthcare devices are another example: a portable ultrasound or MRI might have an NPU to do AI image analysis on-device, so doctors get instant diagnostic help without needing an internet connection (which is also better for patient data privacy). Retail and cities deploy AI at the edge too – e.g. smart traffic cameras with NPUs to analyze congestion and adjust lights, or retail shelf cameras that track inventory. Many of these use specialized NPUs like the Intel Movidius Myriad chips or Google’s Edge TPU or new entrants like Hailo-8 (an Israeli NPU that delivers 26 TOPS in a few watts for cameras). The common thread is these accelerators allow analysis to happen locally, achieving real-time results and keeping only high-level insights (rather than raw data) flowing over networks.

The versatility of NPUs/TPUs across device types is impressive. One moment they’re enabling your phone to blur the background in a photo with AI and the next they’re guiding a drone or scanning medical images. Smartphone cameras now use NPUs for features like Night Mode (aggregating multiple frames intelligently), Portrait mode bokeh, scene recognition (your phone knows you’re shooting a “sunset” and optimizes colors via AI), and even for fun AR effects (Animoji mapping your face, or Snapchat filters tracking your movements – all thanks to on-device neural nets). Biometrics use NPUs: fingerprint scanners enhanced with AI for liveness detection, face unlock with depth sensors plus AI. Audio uses them too: noise cancellation in earbuds and phones is now often AI-driven, with NPUs separating voice from background noise in real time.

A concrete example of 2024 innovation: Oppo (the smartphone maker), in partnership with MediaTek, announced it implemented a Mixture-of-Experts (MoE) AI model directly on-device in late 2024 – reportedly the first to do so in a phone grandviewresearch.com. This advanced neural network architecture (MoE) can boost performance by activating only relevant “expert” subnetworks per task, and doing this on-device means Oppo phones can achieve faster AI processing and better energy efficiency for complex tasks, without needing cloud assistance grandviewresearch.com. It underscores how even cutting-edge AI research is quickly making its way into our handheld devices through improved NPUs.

Inside the 2025 AI Chips: Latest Developments from Apple, Google, Qualcomm, and More

The race to build better on-device AI hardware has heated up rapidly. Here’s a look at what major companies have rolled out recently (2024–2025) in terms of NPUs/TPUs and AI silicon:

Apple: Apple’s custom silicon strategy has long emphasized on-device machine learning. Every year, Apple’s Neural Engine has grown in power. In the 2023 iPhone 15 Pro, the A17 Pro chip’s Neural Engine hit 35 TOPS (trillion operations per second) with its 16 cores apple.fandom.com. This was double the raw throughput of the A16’s NPU, and Apple used that to enable things like on-device speech recognition for Siri (finally processing many Siri requests without internet) and new camera capabilities (like Portrait mode automatically captured, and live translation of text via the camera). Apple’s 2024 chips continued the trend: the M3 family for Macs (late 2023) got an updated Neural Engine (though interestingly tuned for 18 TOPS for M3 base chip, focusing more on efficiency) apple.fandom.com. In 2024, Apple introduced the M4 chip (for high-end iPads/Macs, mid-2024) which reportedly raised the Neural Engine to 38 TOPS on a refined 3nm process apple.fandom.com. Beyond just numbers, Apple has been using that NPU: features like Personal Voice (which creates a clone of a user’s voice after 15 minutes of training) run privately on the Neural Engine in iPhones, and Live Voicemail transcriptions happen locally. Apple has also integrated NPUs into all its device classes – even AirPods Pro have a tiny neural chip for Adaptive Audio. Apple’s execs often highlight the privacy angle: “machine learning on your device” means your data stays with you. By 2025, we expect Apple’s Neural Engine to possibly expand further or become available to third-party apps in new ways (already Core ML lets devs use it, but Apple could open more neural API access). There’s also rumor of Apple designing a standalone AI accelerator for future glasses or cars, but current products show they prefer integrated NPUs in their A-series and M-series SoCs.
Google: Google not only pioneered the cloud TPU but also doubled down on on-device AI for Pixel phones and consumer devices. The Google Tensor SoC (first introduced 2021 in Pixel 6) was unique in that Google, famous for cloud, made a phone chip to run AI on the handset. By the Tensor G3 (in 2023’s Pixel 8), Google highlighted upgrades enabling generative AI on-device. Google explicitly said Pixel 8’s chip brings “Google AI research directly to our newest phones” blog.google. The Tensor G3’s next-gen TPU (Google still calls the AI core a “TPU” internally) allows the Pixel to run advanced models like Palm 2 or Gemini Nano (slimmed versions of Google’s large language models) on the device for features like summarizing websites or voice typing improvements reddit.com. One headline feature: Pixel 8 can run Google’s best text-to-speech model (the one used in data center) locally, which lets the phone read webpages aloud in natural voices and even translate them in real time, all offline blog.google. Google also uses the TPU in Pixel for photography (“HDR+” multi-frame imaging, Magic Eraser object removal using AI inpainting blog.google), for security (on-device face unlock via AI now deemed strong enough for payments blog.google), and for speech (the Assistant that doesn’t mind you saying “umm”). Beyond phones, Google offers the Coral Dev Board and USB stick for hobbyists and enterprises to add Edge TPUs to their projects, each containing Google’s Edge TPU that provides 4 TOPS for vision tasks at very low power coral.ai. It’s used in some of Google’s own products like the Nest Hub Max for gesture recognition. For Google, integrating TPUs at the edge is part of a broader strategy: Sundar Pichai (Google’s CEO) has said the future of AI is about augmenting every experience, and clearly, Google sees that “to bring the transformative power of AI to everyday life, you need to access it from the device you use every day” blog.google – hence Tensor chips. We might anticipate a Tensor G4 in late 2024 Pixel phones, possibly built on Samsung or TSMC’s newer process, further improving AI performance and efficiency, maybe even enabling on-device multimodal AI (combining vision+language models).
Qualcomm: The leading mobile chip vendor for Android phones has aggressively pushed its AI Engine in the Snapdragon series. The Snapdragon 8 Gen 2 (late 2022) introduced dedicated INT4 support and showed off real-time stable diffusion image generation on a phone. The Snapdragon 8 Gen 3 (announced late 2023, in 2024’s flagship phones) is a major leap: Qualcomm says its Hexagon NPU is 98% faster than the Gen 2’s and 40% more power-efficient futurumgroup.com. This chip can run large language models with up to 10 billion parameters entirely on-device, processing about 20 tokens per second – enough for simple conversations with an AI assistant without the cloud futurumgroup.com. It also achieved the “world’s fastest Stable Diffusion” image generation on a mobile device in demos futurumgroup.com. Qualcomm has been vocal that on-device generative AI is a key selling point for new phones. For instance, they partnered with Meta to optimize the open-source Llama 2 LLM for Snapdragon, aiming to let you run a chatbot AI on your phone by 2024 iconnect007.com. (One Qualcomm exec said: “we applaud Meta’s open approach… to scale generative AI, it must run on both cloud and edge”, reinforcing the edge AI philosophy iconnect007.com.) Beyond phones, Qualcomm is putting NPUs in laptop chips (the Snapdragon compute platforms for Windows on ARM) – and their automotive platform Snapdragon Ride uses the same AI cores to offer up to 30 TOPS for ADAS, with a roadmap into hundreds of TOPS. In 2025, Qualcomm even announced a new Snapdragon X Elite CPU for PCs that includes a beefy NPU, signaling an aim to challenge Apple and Intel on AI performance in personal computers. With the rise of on-device AI, Qualcomm is actually branding some phones “AI phones.” They project that many apps (from photography to messaging to productivity) will leverage the NPU. On the software side, Qualcomm released the Qualcomm AI Stack to unify support for popular frameworks (TensorFlow Lite, PyTorch, ONNX) on their NPUs iconnect007.com – trying to make it easier for developers to use the AI hardware without deep chip knowledge.
MediaTek: The #2 mobile chip maker (known for Dimensity series) has also upgraded its NPUs. MediaTek calls its AI engines “APU” (AI Processing Unit). For example, the Dimensity 9200+ (2023) has a sixth-gen APU with significant performance uplift over the previous chip, enabling features like on-device stable diffusion and AI noise reduction in videos. In 2024, MediaTek announced the Dimensity 9400, and in a partnership with Oppo, they utilized its advanced NPU architecture to introduce new AI features (as mentioned, Oppo Find X8’s AI photo remastering with reflection removal and unblurring is powered by MediaTek’s NPU) mediatek.com. MediaTek executives have explicitly positioned themselves at the forefront of on-device AI. As Will Chen of MediaTek put it, “the future of AI transcends the cloud; it is driven by edge computing right from the palm of your hand.” In their view, AI on phones must be fast, private, secure, and consistently accessible mediatek.com. MediaTek even formed an “APU-centric” collaboration with Meta to support Llama frameworks and with device makers like Oppo and Xiaomi focusing on AI camera and AI voice features. By 2025, MediaTek plans to roll out these NPUs not just in phones, but also in smart TVs (for AI upscaling and picture enhancement), IoT devices, and even autos (MediaTek has an automotive AI platform and has partnered with Nvidia to integrate Nvidia GPU IP for cars, while presumably providing its own NPU for sensor AI).
Intel: 2024 marked Intel’s entry into AI accelerators on mainstream PCs. Intel’s 14th Gen Core (Meteor Lake, launched Dec 2023 and rebranded as Core Ultra in 2024) is the first x86 PC processor with a built-in neural processor unit (NPU). Meteor Lake’s NPU (sometimes called the VPU – Vision Processing Unit – based on Intel’s Movidius tech) delivers about 8–12 TOPS of AI performance pcworld.com. This is used to accelerate Windows 11’s AI features like background blur, eye contact in video calls, and could be used by apps for things like local transcription, noise suppression, or even small AI assistants. Microsoft and Intel together have been pushing the concept of the “AI PC.” Intel claims these NPUs will ship in tens of millions of laptops in 2024 pcworld.com. Following Meteor Lake, Intel’s roadmap mentions Arrow Lake (for desktops in 2024) which also included an NPU (around 13 TOPS, slightly improved) pcworld.com. Interestingly, Intel’s first attempt at a desktop NPU was actually outdone by AMD (see below), and Intel chose to go with a modest NPU design to avoid sacrificing GPU/CPU area in enthusiast chips pcworld.com. But by late 2024, Intel signaled future Lunar Lake chips will have a much beefier NPU (~45 TOPS) to meet Microsoft’s “Copilot” requirements pcworld.com. All of this indicates Intel sees AI as a must-have for PCs moving forward – not for training huge models, but for accelerating everyday AI-powered experiences (from office suite enhancements to creative tools using local AI). Intel also sells edge AI accelerators like the Intel Movidius Myriad chips (used in some drones, cameras) and the Habana accelerators for servers, but Meteor Lake’s integrated NPU is a milestone bringing AI to the average consumer device.
AMD: AMD jumped into on-device AI around the same time. Its Ryzen 7040 series laptop processors (Phoenix) released in 2023 featured the first Ryzen AI Engine – essentially an integrated XDNA NPU (tech from AMD’s Xilinx acquisition). This NPU delivered up to 10 TOPS on the mobile chip en.wikipedia.org. AMD touted use cases like AI-enhanced video calls, productivity apps, and so on, similar to Intel’s aims. Then AMD briefly launched a Ryzen 8000 desktop series (early 2024) with an NPU hitting 39 TOPS – a very high number for a general-purpose CPU’s AI unit, even surpassing Intel’s plans pcworld.com. However, AMD quickly changed course and skipped a generation, focusing on its next architecture (the subsequent Ryzen 9000 in late 2024 dropped the NPU to prioritize core upgrades) pcworld.com. Nonetheless, AMD is expected to bring NPUs back in future PC chips (it’s likely a temporary retreat as they work on integrating a strong AI engine without compromising other performance). On the product side, AMD’s NPUs could enable interesting things since AMD also has strong GPUs – a combination could handle AI workloads collaboratively (some parts on NPU, some on GPU). AMD has also been putting AI cores into its adaptive (FPGA-based) SoCs and automotive chips. In summary, by 2025 all x86 PC chipmakers have embraced NPUs, aligning with what smartphones did a few years prior, indicating AI acceleration is becoming a standard feature across the board.
Others: A variety of specialized chip companies and other tech firms are innovating in NPUs as well. NVIDIA, known for GPUs, now includes dedicated Tensor Cores in their GPUs and offers an open NVDLA (deep learning accelerator) design for integration into System-on-Chip products. In edge devices like the NVIDIA Jetson series (used in robots, drones, embedded systems), there are both the GPU and fixed-function “DLAs” – essentially NPUs – that offload some neural network inference from the GPU. NVIDIA’s Orin module for example has 2 DLAs in addition to its GPU, contributing to its 254 TOPS of AI performance for cars ts2.tech. Apple is rumored to be working on even more advanced AI coprocessors or larger neural engines for their AR glasses or future projects, though details are secret. Huawei (despite geopolitical challenges) continues to design Kirin mobile chips with NPUs (their “DaVinci” NPU architecture) and also server-class NPUs in their Ascend AI chips – their 2023 Kirin 9000S chip reportedly retains a strong NPU for image and language tasks on their phones. We also see startups like Hailo, Mythic, Graphcore, and others offering their own edge AI chips: e.g. Hailo-8 as mentioned (26 TOPS in a mini PCIe card for AI cameras), Graphcore’s IPU for datacenters (not exactly on-device, but a new architecture for neural nets), Mythic working on analog NPUs, etc. ARM, whose designs underlie most mobile chips, offers the Ethos NPU series (such as Ethos-U, Ethos-N78) that chipmakers can integrate to get a ready-made AI accelerator in IoT or mid-range SoCs. This has allowed even relatively smaller players to include NPUs in their chips by licensing ARM’s design.

The bottom line is that from big tech to startups, everyone is investing in on-device AI silicon. As a result, we’re seeing rapid improvements: new chips boasting higher TOPS, better efficiency (TOPS per watt), and support for new data types (like 4-bit quantization for larger models). For example, Qualcomm and MediaTek’s latest can run INT4 precision which is great for generative AI models where memory bandwidth is a limiter androidauthority.com. These innovations directly translate to user benefits – e.g. real-time mobile AI video editing (removing objects from 4K video on the fly, as Snapdragon 8 Gen 3 can do with its “Video Object Eraser” AI feature futurumgroup.com), or AI co-processors in cars enabling voice assistants that work with no network and respond as fast as a human conversation.

Key News from 2024–2025: Launches, Benchmarks, and Partnerships

To illustrate how fast things are moving, here are some headline events in the world of NPUs/TPUs and on-device AI from late 2024 into 2025:

Apple M3 and M4 unveilings (Oct 2023 & May 2024): Brought next-gen Neural Engines. M3’s Neural Engine does 18 TOPS (16-core), and the M4 jumped to 38 TOPS (still 16-core but higher clock/efficiency ) apple.fandom.com. Apple demonstrated these chips handling intensive tasks like on-device stable diffusion image generation in macOS (with Core ML Stable Diffusion, developers showed ~15 seconds to generate an image on an M2 – even faster on M3/M4).
Google Pixel 8 launch (Oct 2023): Emphasized AI “everywhere” in the device. Google’s event demoed Pixel 8’s on-device summarization of web pages and live translation of articles using its Tensor G3 NPU. It also introduced the “Assistant with Bard” which will eventually run some interactions on-device. Google touted that Pixel 8 can run 2× as many models on-device as Pixel 6 could, and models that are far more sophisticated blog.google. In other words, a huge jump in just two years of Tensor chip development.
Qualcomm–Meta partnership (July 2023): Qualcomm and Meta announced they are optimizing Meta’s Llama 2 large language model to run fully on Snapdragon NPUs by 2024 iconnect007.com. The aim is to enable developers to deploy chatbots and generative AI apps on phones, VR headsets, PCs, etc., without cloud. This was a significant endorsement of on-device AI by a major AI model owner (Meta) and a major chipmaker. In late 2024, they followed up with plans for Llama 3 optimization as well qualcomm.com.
Microsoft Windows 11 “Copilot” PCs (2024): Microsoft set a benchmark calling PCs with >40 TOPS of local AI acceleration “AI PCs” eligible for enhanced AI features (like the Copilot digital assistant integration). This pushed OEMs – Lenovo, Dell, etc. – to adopt chips with NPUs (be it Intel, AMD, or Qualcomm) to meet the spec. The result is an expected wave of AI-capable laptops in 2024, with Microsoft claiming dozens of models on the way and forecasting over 40 million AI PC shipments in 2024 pcworld.com.
AMD’s brief Ryzen 8000 NPU (Jan 2024): AMD announced a desktop CPU with a whopping 39 TOPS NPU (a surprise since desktop chips usually lack such accelerators) pcworld.com. Although that particular product was superseded quickly, it showed that even desktop CPUs can have AI silicon rivaling mobile chips in TOPS. This was also the first desktop x86 CPU to carry an NPU (just beating Intel Arrow Lake to the punch).
Tesla FSD Beta v12 (late 2023) demos: Elon Musk showcased end-to-end AI driving (no radar, just vision nets) running on Tesla’s HW3/HW4 NPUs. Notable was the neural network driving the car using video feeds processed entirely on the car’s computer in real time. Observers noted FSD v12 fully utilized the 2× 100 TOPS NPUs for vision, and Tesla hinted that future upgrades (HW5) aiming for 2000 TOPS may be in development to handle even larger models (there were rumors Tesla’s HW5 could target 2 petaFLOPS = 2000 TOPS) notateslaapp.com.
NVIDIA Drive Thor revealed (2024 GTC): NVIDIA unveiled details of its next automotive chip, Drive Thor, that packs the equivalent of 2× the AI compute of its predecessor Orin – up to 2000 TOPS when two chips linked ts2.tech. Significantly, Thor is pitched as handling not just driving tasks but also in-cabin AI (like voice and occupant monitoring) on one platform, showing how NPUs and GPUs together can consolidate many AI functions in cars ts2.tech. Several automakers (Xpeng, BYD, Volvo) announced they will use Thor from 2025 ts2.tech.
Oppo’s on-device MoE AI (Oct 2024): As mentioned, Oppo implemented a Mixture-of-Experts model on the Find X8 phone grandviewresearch.com. This is newsworthy because MoE models are usually large and were considered server-side due to their complexity. Running MoE on-device suggests new techniques in model compression and a very capable NPU (likely the MediaTek Dimensity 9400 in that device).
Meta’s Ray-Ban AI glasses (2025): (Expected) Meta showed off prototypes of smart glasses that can identify what you see and speak to you about it – likely using an on-board custom accelerator (Meta has been prototyping custom silicon for AR). While details are scant, it underscores the push to put AI into very constrained devices (glasses, battery earbuds) which would necessitate ultra-efficient NPUs.
MLPerf Mobile Inference Benchmarks (2023–24): MLCommons released results showing latest smartphones’ AI prowess. For example, in MLPerf Inference v3.0 (Oct 2023), Apple’s A16, Google Tensor G2, and Qualcomm Gen 2 were all benchmarked on tasks like image classification and object detection. The numbers showed Apple and Qualcomm trading wins but generally that mobile NPUs are closing the gap with some laptop/desktop class accelerators for those tasks – all while on battery. It also highlighted software differences (e.g. Qualcomm’s AI SDK vs. Apple Core ML). The continued improvements each year (double-digit % gains) in these benchmarks demonstrate the healthy competition and rapid progress in on-device AI.
Strategic partnerships: Many cross-industry partnerships formed. E.g., NVIDIA and MediaTek (May 2023) announced a tie-up to put Nvidia GPU IP and software ecosystem into MediaTek’s future smartphone and automotive chips, effectively marrying Nvidia’s AI strengths with MediaTek’s mobile SoC expertise. Also, companies like Qualcomm are partnering with carmakers (Mercedes, BMW) to put Snapdragon Cockpit and Ride platforms (with NPUs) into new vehicles for AI features. Arm has been partnering with Fujitsu and others for new AI chip designs (like the Fugaku supercomputer’s AI partition, though that’s high-end). Even IBM and Samsung teased new chip technologies (like neuromorphic computing and AI memory) that could one day revolutionize NPUs – not here yet, but showing that research pipelines are full.

All told, the past year has been packed with developments, underlining that on-device AI is one of the hottest areas in tech. As one industry analyst noted, “these on-device capabilities unlock entirely new horizons… running LLMs on mobile helps address scale and cost, keeps data private, and ensures AI works even with limited connectivity” futurumgroup.com. That pretty much sums up why every big tech firm is investing here.

Expert Insights: What Tech Leaders Say about On-Device AI

The momentum behind NPUs and TPUs is not just evident in products but also in the words of industry leaders. Here are a few choice quotes and perspectives that shed light on the significance of on-device AI:

Cristiano Amon (CEO of Qualcomm): “If AI is going to get scale, you’re going to see it running on devices… This marks a turning point in AI: no latency issues — just seamless, secure, cloud‐complementary on-device inference. The future of AI is personal, and it starts on your device.” (Bloomberg interview and X post, 2023) x.com. Amon envisions a hybrid AI world where your phone/PC handles a lot on its own NPUs, working with the cloud when needed. He emphasizes that running AI locally is key to making it ubiquitous (you can’t have everything rely on cloud GPUs – there aren’t enough of them in the world for billions of devices).
Durga Malladi (SVP, Qualcomm): “We applaud Meta’s approach to open and responsible AI… To effectively scale generative AI into the mainstream, AI will need to run on both the cloud and devices at the edge.” iconnect007.com Malladi said this in the context of the Meta partnership. It highlights a common view: scaling AI = cloud + edge working together. There’s an understanding now that purely cloud AI won’t be sufficient (for cost, privacy, and latency reasons), so edge AI must share the load.
Will Chen (Deputy GM, MediaTek): “The future of AI transcends the cloud; it is driven by edge computing right from the palm of your hand… OPPO and MediaTek are pioneering on-device AI, ensuring that intelligent capabilities are powerful, fast, private, secure, and consistently accessible.” (MediaTek Exec Talk, 2025) mediatek.com. This quote neatly encapsulates the value proposition of on-device AI – you get performance and accessibility plus privacy and security. It also shows that even companies traditionally less visible in the West (like MediaTek) are thinking on the cutting edge of AI deployment.
Dr. Norman Wang (AI hardware expert, CEO of a chip startup): “In AI hardware, the closer you can put compute to the data source, the better. It’s about reducing data movement. An NPU next to your image sensor means you’re not shuttling megapixels to the cloud – you’re distilling insights right at the edge. That’s a game changer for latency and power.” (Panel at HotChips 2024 – paraphrased). This technical insight explains why NPUs often sit on the same silicon as other components: e.g., on a phone’s SoC, the NPU can directly grab camera data from the ISP. Minimizing data movement is a huge part of efficient AI, and edge AI achieves that by doing processing at the source of the data.
Xinzhou Wu (VP of Automotive, NVIDIA): “Accelerated compute has led to transformative breakthroughs, including generative AI, which is redefining autonomy and the transportation industry.” (GTC 2024 Keynote) ts2.tech. He was discussing how powerful on-board computers (with NPUs/GPUs) enable cars not just to drive, but to potentially incorporate advanced AI like generative models for things like natural language interfaces in the car or better understanding of situations. It underscores that even sectors like automotive see on-device AI as not only for core functionality but also for improving user experience (e.g., voice assistants in cars that can hold conversations thanks to on-board LLMs).
Sundar Pichai (CEO of Google): “The future of AI is about making it helpful for everyone. That means bringing AI into all the devices we use – phones, appliances, cars – so it’s there when you need it. We want to meet users where they are, with AI that works in real-time, on-site, and preserves privacy.” (Paraphrased from multiple interviews/keynotes). Pichai often talks about “ambient AI” – the idea that AI will be all around us, embedded in things. Google’s push with Tensor chips in Pixels is a direct execution of that philosophy.
Industry Stats: Analysts have observed the trend in numbers. A report by Grand View Research in 2024 noted: “Recent advancements in specialized AI chips and NPUs have enabled complex AI algorithms to run directly on devices, significantly enhancing performance and energy efficiency… we are nearing a pivotal transition toward on-device AI.” grandviewresearch.com. The same report projects the on-device AI market to explode in coming years, with the hardware segment (NPUs, etc.) making up over 60% of the revenue share in 2024 and growing as nearly every new IoT or mobile device adopts AI capabilities grandviewresearch.com. Another forecast by IDC and others suggests that by mid-2020s, almost all high-end smartphones and the majority of mid-range ones will have AI accelerators, and that by 2030, billions of edge AI chips will be in use from consumer electronics to smart infrastructure.

The consensus among experts is that on-device AI isn’t just a nice-to-have – it’s essential for the next wave of technology. AI pioneer Andrew Ng has often mentioned that “tiny AI” and edge AI will allow intelligence to penetrate every object, analogous to how electricity or the internet did in earlier eras. By overcoming the limitations of cloud-only AI, NPUs and TPUs are enabling this penetration.

The Challenge of Many Standards (and Efforts to Simplify)

While the hardware has advanced quickly, the ecosystem of software and standards for on-device AI is still catching up. Developers face a jungle of tools and SDKs when trying to leverage NPUs across different devices nimbleedge.com. Key points:

Each platform has its own API or SDK: Apple has Core ML (with APIs to target the Neural Engine), Android has Neural Networks API (NNAPI) (though Google announced plans to evolve it beyond Android 14) threads.com, Qualcomm offers the SNPE (Snapdragon Neural Processing Engine) or more broadly the Qualcomm AI Stack, NVIDIA has TensorRT and CUDA for its devices, and so on. There’s also ONNX Runtime, TensorFlow Lite, PyTorch Mobile, MediaTek NeuroPilot, Huawei HiAI, and others. These varying SDKs often have different capabilities and require model tweaking to run optimally on each target. As a 2025 on-device AI report noted, “Multiple, incompatible SDKs (e.g., Core ML, LiteRT, ONNX Runtime) with varying operator support and performance” force developers to do extra work nimbleedge.com.
Fragmentation issues: A model that runs perfectly on a desktop GPU might not readily run on a phone’s NPU – operators (the math functions) might not be supported or need to be quantized differently. Developers sometimes have to maintain separate builds or manually optimize models for each hardware. This is the “low-level, fragmented ecosystem” complaint nimbleedge.com. Debugging tools are also sparse – profiling an NPU to see why a model is slow can be hard, especially compared to the rich tools for CPUs/GPUs nimbleedge.com.
Standardization efforts: To tackle this, there are a few things happening. ONNX (Open Neural Network Exchange) has emerged as a common format so you can train a model in PyTorch or TensorFlow and then export to ONNX for deployment. Many runtimes (including on-device ones like Qualcomm’s and MediaTek’s) support ingesting ONNX models and will attempt to compile them for the hardware. This helps avoid lock-in to a single framework. Android NNAPI was an attempt by Google to provide a universal interface – an app can request “run this neural net” through NNAPI and the OS will use whatever accelerator is present (GPU, DSP, or NPU) to execute it. NNAPI saw adoption in many Android devices, but it had limitations and not all vendors provided robust drivers, leading Google to indicate a new strategy (possibly leaning on WebNN or direct vendor integrations) beyond 2024 threads.com. On PCs, Microsoft introduced DirectML and Windows ML APIs to similarly abstract hardware differences (allowing a developer to use the same API for NVIDIA, Intel, AMD NPUs).
Unified Toolchains: Companies are also building toolchains to streamline deployment. We saw Qualcomm’s AI Stack which combines their compiler (AI Model Efficiency Toolkit) and runtimes so developers can target their Hexagon NPU more easily iconnect007.com. NVIDIA’s TensorRT and related SDKs do something similar for Jetson devices, optimizing models for GPU+NVDLA. Intel OpenVINO is another – it lets you take a model and optimize it for Intel CPUs, iGPUs, and VPUs (NPUs) for edge deployments. These frameworks often include model optimizers that convert models (pruning, quantizing) to fit on smaller devices.
Interoperability: There is movement toward making different NPUs work with common frameworks. For example, Google’s TensorFlow Lite has hardware delegates – one for NNAPI (covers Android devices generically), one for Core ML (iOS devices), one for Edge TPU, etc. The idea is you write your TFLite model and it will execute using the best accelerator available via the delegate. Similarly, PyTorch has been adding support for mobile backends and even things like Apple’s Metal Performance Shaders (to use GPU/NPU on iOS). ONNX Runtime can also target different accelerators via plugins (e.g., one can plug in NVIDIA’s TensorRT or ARM’s Compute Library or others under the hood).
Emerging standards: The Khronos Group (behind OpenGL/Vulkan) worked on NNEF (Neural Network Exchange Format) and there’s WebNN API being discussed for browsers to access local AI acceleration. None are universally adopted yet. But one interesting development: in late 2024, several companies formed an alliance to push for “AI Hardware Common Layer” standards – basically, exploring if a common low-level interface to NPUs could be made (analogous to how OpenCL did for compute on GPUs). It’s early, though.
Developer experience: It’s an acknowledged gap. As NimbleEdge’s blog said, “developing for on-device AI currently requires navigating a fragmented and low-level ecosystem… forcing developers to tailor implementations for each hardware target” nimbleedge.com. The industry knows this has to improve for on-device AI to truly go mainstream. We may see consolidation – for instance, if Google and Apple and Qualcomm could all agree on some core set of ops and API (wishful thinking, perhaps). Or more likely, frameworks like PyTorch and TensorFlow will hide the complexity by integrating all those vendor libraries and picking the right one at runtime.

In essence, while NPUs/TPUs provide the muscle, the community is working on brain-friendly tools to use that muscle. The good news is that compared to, say, five years ago, there are far more options to deploy a model on-device without being a chip expert. But there’s room to grow – especially in debugging, profiling, and multi-hardware support.

Market Trends and Future Outlook

The proliferation of NPUs and TPUs in devices is driving a larger trend: AI everywhere. Here are some high-level trends and what to expect looking forward:

Edge AI Market Growth: Market research indicates explosive growth in edge AI hardware. The on-device AI market (including chips and software) is projected to grow at ~29% CAGR through the decade nimbleedge.com. One report valued it at ~$233 billion in 2024, heading to over $1.7 trillion by 2032 nimbleedge.com – a lot of that growth riding on edge deployments. Another analysis by IDTechEx forecast the AI chip market for edge devices will hit $22 billion by 2034, with consumer electronics, automotive, and industrial being the biggest segments idtechex.com. This implies hundreds of millions of devices per year shipping with NPUs as a standard component.
Ubiquitous Adoption: Much like every smartphone today has a GPU (even if small), we’re reaching the point where every new smartphone will have an AI accelerator. High-end phones have them now; mid-range phones are next. Indeed, mid-tier chips from Qualcomm (e.g. Snapdragon 7 series) and MediaTek (Dimensity 700/800 series) now include scaled-down NPUs so that features like AI camera enhancements and voice assistant can work on cheaper devices too. Beyond phones, NPUs are spreading to PCs (standard in new Windows laptops by multiple vendors), cars (almost all new cars with ADAS Level 2+ have some kind of AI chip), and IoT. Even appliances like fridges and washing machines are starting to tout “AI” features (some of which are cloud-based, but some local like adaptive cycles based on sensors). The trend is clear: if a device has a compute chip, it will have some ML acceleration on that chip.
Performance Trajectory: On-device AI performance is doubling roughly every 1–2 years (combination of better architecture and moving to advanced semiconductor nodes like 5nm, 4nm, 3nm). Apple’s Neural Engine went from 600 billion ops/sec in 2017 to 35 trillion in 2023 – nearly a 60× increase in six years apple.fandom.com. Qualcomm’s flagships similarly jumped from a few TOPS in 2018 to over 27 TOPS in 2023 (SD 8 Gen 3’s total AI compute, counting all cores). We can expect by 2025–2026 mobile NPUs delivering 100+ TOPS, and PC accelerators even more, and these figures may become less relevant as focus shifts to usable performance on specific AI tasks (for example, how big an LLM can you run smoothly, or can you do 4K AI video in real-time). The gap between cloud and edge will likely narrow for inference tasks. However, edge will still lag cloud for the absolute cutting-edge large models due to power and memory constraints.
Energy Efficiency Gains: One underrated aspect is how efficient these NPUs are getting. Tesla’s car NPU achieves ~4.9 TOPS/Watt fuse.wikichip.org which was state-of-the-art a couple years ago; now some mobile NPUs claim similar or better. Efficient NPUs mean longer battery life even as we use AI features more. It also means putting AI into tiny battery-powered devices becomes feasible (e.g. AI hearing aids, smart sensors running on coin-cell batteries performing anomaly detection). The concept of TinyML – extremely small-scale machine learning on microcontrollers – is an extension of this, using simplified “NPUs” or optimized instructions on microcontrollers to do AI in sensors. ARM’s Ethos-U NPU is aimed at that segment (e.g. always-on keyword spotting running on a few milliwatts). Expect more AI-specific tiny chips that can be embedded into sensors, wearables, and everyday objects (Smart toothbrush? AI-powered smoke detector? It’s coming).
Hybrid Cloud-Edge Solutions: Rather than edge completely replacing cloud, the future is collaboration. Devices will do what they can locally and only reach out for what they can’t. For instance, your AR glasses might run local scene recognition to know what you’re looking at, but if you ask a very complex question (like a thorough explanation), it might query a cloud AI for a more powerful analysis and then present it. This hybrid approach gives the best balance of responsiveness and capability. Companies are actively designing experiences around this: Microsoft’s Copilot on Windows might use the local NPU to do quick voice-to-text and command parsing, but then use cloud for heavy lifting (unless maybe you have a beefy PC NPU that can handle it). The user ideally shouldn’t know or care which is used, other than things being faster and privacy-respecting. We’ll also see federated learning become more common – models train in the cloud but with help of data encrypted or processed on devices, and vice versa.
Emerging Use Cases: As NPUs become more powerful, new applications open up. Generative AI on-device is a big one – imagine AI image creation, AI video editing, and personal chatbots all on your phone or laptop. By 2025, we might see early versions of offline personal assistants that can summarize your emails or draft messages without cloud. Real-time language translation in conversation (two people speaking different languages, with phones or earbuds translating in near real-time) will be vastly improved by on-device processing (no lag and works anywhere). Health AI might live on wearables – your smartwatch detecting atrial fibrillation or analyzing sleep apnea patterns using its NPU. Security: devices might locally run AI to detect malware or phishing in real-time (e.g., antivirus using an AI model on your device rather than cloud scans). And in vehicles, besides driving, AI could personalize the in-car experience (adjust climate control based on your perceived mood via driver-facing camera AI, etc.). Many of these use cases require quick iteration and privacy, which suits on-device.
Competition and Democratization: The big players will keep competing, which is good for consumers – expect marketing of “our AI chip does X TOPS or enables Y feature that others can’t.” But also, the technology is democratizing – NPUs aren’t just in $1000 phones; they’re coming to $300 phones, $50 IoT boards (Coral, Arduino Portenta, etc.), and open-source communities are creating tiny AI models that hobbyists can run on a Raspberry Pi or microcontroller with a basic accelerator. This widespread availability means innovation can come from anywhere. A lone developer can now build an app that uses on-device AI to do something clever without needing a server farm – lowering the barrier to entry for AI-driven software.
Future Tech: Looking further out, research into neuromorphic computing (brain-inspired chips like Intel Loihi) and analog AI chips could one day revolutionize NPUs, offering orders-of-magnitude efficiency gains. Companies like IBM and BrainChip are working on these. If successful, a neuromorphic chip might allow complex AI to run on tiny battery devices continuously. We might also see 3D stacking and new memory tech integrated into NPUs to overcome memory bottlenecks (some 2025+ chips might use HBM memory or new on-chip non-volatile memory to feed AI cores faster). Also, expect more specialization within AI chips: e.g., separate accelerators for vision, for speech, for recommendation models, etc., each tuned to their domain. Some SoCs already have dual NPUs (one “big” NPU for heavy tasks, one micro NPU in sensor hub for always-on light tasks).

In conclusion, the trajectory is clear: NPUs and TPUs are becoming as standard and as indispensable as CPUs in modern computing. They empower devices to be smarter, more responsive, and more considerate of our privacy. As one report stated, “high-performance processing units on devices are largely responsible for executing complex AI functions like image recognition, NLP, and real-time decision-making”, and this is driving more intelligent, responsive tech across sectors grandviewresearch.com.

We are entering an era where you’ll simply expect your device to understand and anticipate your needs – your phone edits photos and writes messages in your style, your car avoids accidents and entertains you with AI, your home gadgets learn your preferences – all made possible by the quiet neural processors inside them. On-device AI isn’t science fiction; it’s here now and rapidly improving. The marriage of NPUs and TPUs with our everyday gadgets is making AI personal, pervasive, and private – truly bringing the power of cloud intelligence down to earth (or at least, down to your pocket).

Sources:

Bigelow, Stephen. “GPUs vs. TPUs vs. NPUs: Comparing AI hardware options.” TechTarget, Aug. 27, 2024 techtarget.com. Describes the roles and differences of CPUs, GPUs, TPUs, and NPUs in AI workloads.
Backblaze Blog. “AI 101: GPU vs. TPU vs. NPU.” Backblaze, 2023 backblaze.com. Explanation of Google’s TPU design (systolic arrays, low precision) and NPU usage in mobile devices.
TechTarget WhatIs. “Tensor processing unit (TPU).” whatis.techtarget.com, 2023 techtarget.com. Notes that TPUs specialize in matrix math tasks and NPUs mimic brain neural networks for acceleration techtarget.com.
NimbleEdge Blog (Neeraj Poddar). “The State of On-Device AI: What’s Missing in Today’s Landscape.” June 26, 2025 nimbleedge.com. Outlines advantages of on-device AI (latency, offline, privacy, cost) and challenges like fragmented SDKs.
Qualcomm (OnQ Blog). “Bloomberg and Cristiano Amon talk on-device AI.” July 2023 x.com. CEO of Qualcomm on the importance of on-device inference for future AI (tweet quote about turning point in AI).
MediaTek Blog (Exec Talk by Will Chen). “Shaping the future of AI mobile experiences.” Mar. 3, 2025 mediatek.com. MediaTek and Oppo collaboration on NPUs; quote about edge computing in your hand and example of AI photo remastering using the NPU.
I-Connect007 / Qualcomm Press. “Qualcomm works with Meta to enable on-device AI (Llama 2).” Jul. 24, 2023 iconnect007.com. Press release with quote from Qualcomm SVP Durga Malladi about scaling generative AI via edge devices and cloud.
PCWorld (Mark Hachman). “Intel’s Core Ultra CPUs keep AI simple….” Oct. 24, 2024 pcworld.com. Discusses Intel Arrow Lake using Meteor Lake’s NPU (13 TOPS) and notes AMD’s Ryzen 8000 39 TOPS NPU and Microsoft’s 40 TOPS “Copilot” requirement.
Ts2 (Tech Empowerment). “Self-Driving Supercomputer Showdown: NVIDIA Thor vs Tesla HW4 vs Qualcomm Ride.” Sep. 2023 ts2.tech. Provides TOPS estimates: Tesla HW3 vs HW4 (72→100 TOPS per chip) ts2.tech, NVIDIA Thor ~1000 TOPS (or 2000 with dual) ts2.tech and quotes NVIDIA VP on generative AI in vehicles ts2.tech.
Grand View Research. “On-Device AI Market Report, 2030.” 2024 grandviewresearch.com. Notes the rise of specialized AI chips (NPUs) enabling complex AI on devices, and that hardware accounted for 60.4% of on-device AI market in 2024, driven by smartphones, IoT, NPUs etc.
Google Blog. “Google Tensor G3: Pixel 8’s AI-first processor.” Oct. 2023 blog.google. Describes Tensor G3’s upgrades for on-device generative AI, new TPU design, and on-device TTS model equal to data center quality.
Techspot. “Snapdragon 8 Gen 3 brings generative AI to smartphones.” Oct. 2023 futurumgroup.com. Futurum Group analysis detailing SD8Gen3’s AI engine: 10B param LLM on-device, 98% faster NPU, world’s fastest Stable Diffusion on phone, etc., plus benefits of on-device LLMs for cost/privacy/offline futurumgroup.com.
Apple Wiki (Fandom). “Neural Engine.” Updated 2025 apple.fandom.com. Neural Engine version history with A17 Pro 35 TOPS in 2023, etc. Shows evolution from 0.6 TOPS (A11) to 35 TOPS (A17) apple.fandom.com and M4 at 38 TOPS apple.fandom.com.
EnGenius Tech. “Cloud Edge Camera AI Surveillance.” 2023 engeniustech.com. Example of security camera with built-in NPU enabling on-camera AI processing and local storage (no NVR needed).
EmbedL. “Amazon releases AZ1 Neural Edge Processor.” Oct. 2020 embedl.com. Discusses Amazon’s AZ1 edge NPU for Echo devices, built with MediaTek, designed for on-device speech inference to cut latency and cloud dependence embedl.com.