Patrick Kennedy's Axautik Group LLC and ServeTheHome Stack

Patrick Kennedy's Axautik Group LLC and ServeTheHome Stack

The NIVIDA BlueField DPU from Acquisition Hodgepodge to Success in AI

And what it means for other DPUs

Patrick Kennedy's avatar
Patrick Kennedy
Sep 17, 2025
∙ Paid

If you examine the NVIDIA BlueField DPU, the Intel E2100/E2200 IPU DPU, the AMD Pensando DPU, FPGA-based DPUs, and many others on the market, they all share a common high-level architectural characteristic. If you read today’s marketing materials, you might think that they are completely different because each vendor likes to focus on the significant differences at lower levels. Perhaps they could be called conceptually similar, but very different in their implementations. Today, we are going to go into a story on the origins of this architecture, and some of the implications for today’s DPUs as they have become a hot technology in the world of AI infrastructure. This high-level architectural feature, which has been in the making for over a decade, has major implications for how hyper-scalers and enterprises adopt and maintain their fleets of DPUs and servers. It also might explain why it took NVIDIA’s full-stack AI solutions to push DPUs into the enterprise.

NVIDIA BlueField 3 DPU Optical Cages

Let us start from the beginning…

Birthing the High-Level Architectural NVIDIA BlueField DPU Archetype

In 2010, the server market was deploying the Intel Xeon 5600 series (Westmere-EP), which scaled to 6 cores/12 threads a significant number in 2010. At the same time, a company called Tilera out of San Jose, California was focused on a TILE CPU architecture with 64 cores, getting it into the mainline Linux kernel in October 2010.

Dell C6100 XS23-TY3 Motherboard Tray Hot Swap

Tilera’s parts were high-core-count, MIPS-like CPUs, but also mainly focused on integer performance. In addition, they had a challenge that Intel had yet to face. They had a lot of cores. While Intel was still focused on ring connections between multiple cores, Tilera needed to use a multi-mesh architecture with a core, cache, and a mesh router. If you are not familiar with this, we covered it quite a bit when Intel transitioned from its Xeon E5 line with the Broadwell-based Xeon E5 V4 family to the 1st generation Intel Xeon Scalable (Skylake parts.)

You can read more about this in the 2017 piece: Things are getting Meshy: Next-Generation Intel Skylake-SP CPUs Mesh Architecture. For those who just want a quick version, in Broadwell, there were rings that the CPU cores had stops on, meaning each would be connected to its neighbor. In contrast, Skylake introduced a new (for Intel) mesh architecture, allowing traversal to occur in multiple directions. For higher-core-count CPUs, the mesh makes a lot of sense. Today’s 128-core and higher server CPUs would choke on the pre-Skylake rings.

While Tilera was spot-on regarding higher core counts and a mesh fabric to connect those cores, it missed perhaps the most crucial aspect: the architecture.

2014 marked the next step for Tilera, and was a big shift. The company was acquired by EZChip, an Israeli company focused on making network chips. EZChip’s big insight was to move the architecture to Arm instead of Tilera.

An important bit of context at this stage is the overall market dynamic. Intel in 2017 was in its 14nm prime time, but TSMC was catching up rapidly. The next 10nm node was where Intel fell behind, but the 2015 Broadwell and 2017 Skylake architectures were on 14nm (Intel’s big process technology until at least Q1 2021.) More importantly for this segment, the Intel Xeon D-1500 “Broadwell-DE” was built largely for Facebook (now Meta) and the 2018 Skylake-D series were also built on 14nm. While Ice Lake server processors (3rd Gen Intel Xeon Scalable) launched in 2021, for the networking and edge side, Ice Lake-D launched in early 2022. If you were building a networking and edge box with Intel, you were still on the same 14nm process entering 2021 that you were on in 2015. Meanwhile, Arm designs were getting better as TSMC was innovating.

In 2014-2015, Arm was emerging as a winner, especially as areas like the smartphone market were taking off. While today Arm is a very reasonable and popular architecture to build around, one has to remember that it was April 2016 with Ubuntu 16.04 LTS when you could (somewhat) buy a Cavium ThunderX server and install from a Ubuntu Arm ISO image, assuming you had the right firmware on your ThunderX server. A decade ago, using Arm servers was a chore. That meant it required developer effort to perform many simple tasks that the x86 side took for granted. For those who lived it, pre-April 2016 Ubuntu 16.04 LTS, you needed dedicated engineers to get Arm server platforms running, and perhaps a spiritual ritual.

Even with all of that, in 2014 Arm’s core counts and SoC counts in the world were growing rapidly. It was emerging as the next architecture, especially for lower-cost devices. Intel started ratcheting up prices before AMD re-entered the market. Arm became the default DPU choice because Intel was increasing pricing, missing its process innovation targets, and did not offer the same type of custom integration of x86 into custom designs.

In Israel, Annapurna Labs was making chips based on Arm cores that were being used by Amazon. Early deployments were probably in the 2013 era, and went well enough that Amazon bought Annapurna Labs in 2015. By 2017-2018, AWS Nitro was presented more openly and shocked many in the industry with just how far ahead it was. That Annapurna Labs AWS Nitro project was getting folks in the industry excited for the DPU segment.

In the middle of this early Nitro rollout, EZChip bought Tilera and transitioned to Arm. EZChip was purchased by Mellanox with this Arm idea around a network processor with the deal announced in late 2015 and closed in early 2016. It would be a bit naive to think that the work between Annapurna Labs and AWS did not impact the Tilera-EZChip-Mellanox acquisitions and the transition to Arm.

Mellanox Innova 2 Architecture

Mellanox had a great NIC, but it was looking for a solution to add more programmability and features for hyper-scale customers. An example of that might have been something like the Innova 2 line where a Xilinx FPGA was attached to a ConnectX-5 NIC on the same board. Another example was the Mellanox BlueField line. I remember seeing this when it was two guys at the Mellanox booth at RSA in San Francisco showing off a card nobody understood. It combined Arm cores from the IP it bought from EZChip, that EZChip morphed from Tilera with a ConnectX NIC.

The bigger point is that Mellanox did not wait. Instead of doing a clean sheet design, it effectively took the EZChip Arm packet processing side and married it with a ConnectX NIC with a PCIe switch connecting the two and a link to the host. Put that model onto a PCIe card, and it was ready to go, albeit without a great sense of what to do with it.

Mellanox Bluefield NVMeoF Solution What Is Inside

One of the early use cases was storage. From a 2018 deck, here is a Mellanox BlueField target application.

Mellanox Bluefield NVMeoF Classic SAN

If you squint, you can probably see how the VAST Data folks looked at this, had AIC make a system, and built a successful storage platform business.

NVIDIA acquired Mellanox, which had previously acquired EZChip, which had acquired Tilera. Eventually, NVIDIA launched the BlueField-2 DPU, sporting Arm cores and a ConnectX-6 NIC, providing a 200Gbps generation DPU. The BlueField-2 DPU was easy to use, but it had a fatal flaw: it had low memory bandwidth on the Arm side. If you want to do high-speed packet processing, or even if you want to build a storage controller, you want a lot more memory bandwidth than a single-channel memory can provide. That led some customers, such as VAST, who were still on BlueField-1, and others, to wait for BlueField-3, where this issue was remedied.

NVIDIA BlueField-2 DPU 2x 100GbE

By the post-pandemic era, AWS Nitro and Annapurna Labs were still the gold standard for DPUs. NVIDIA still had great NIC IP from Mellanox, but big customers wanted something like the AWS Nitro. That leads us to a unique setup where DPUs from NVIDIA, AMD, and Intel appear similar at a high level, despite their claims of uniqueness.

The Modern DPU Trio

While the previous section focused on history, it is important to note the context as to why NVIDIA, AMD, and Intel have similar architectures.

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Patrick Kennedy
Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture