NVIDIA is Working Around Limitations with its Grace CPUs to Prevent a Broadcom Cash Cow in Systems
NVIDIA Grace is missing some common I/O features
To say that NVIDIA Grace, the company’s Arm-based data center CPU, is integral to the company’s future would be an understatement. Moving NVIDIA’s customers off of x86 and onto Arm means that NVIDIA can drive the pace of innovation for I/O at a different rate than if it is beholden to Intel and AMD x86 CPUs. While NVIDIA focuses on its chip-to-chip connectivity with Grace and its memory bandwidth, there are two challenges with Grace’s PCIe lanes that it downplays. For one of the challenges, NVIDIA is getting creative with OEMs to keep Broadcom from inserting its cash cow chips into systems.
This week on STH, we discussed how we are finally seeing CXL gain traction, albeit slowly, in the data center. Some may have noted that we did not discuss NVIDIA’s CXL plans in that piece. That might be the higher-level challenge. Even though the ability to iterate on PCIe connectivity faster than Intel and AMD is a major driver for NVIDIA with Grace, those PCIe root complexes are perhaps not as robust as one might find on Intel Xeon and AMD EPYC CPUs.
2022 was weird. We were exiting the pandemic, and many companies were still in heavy work-from-home mindsets. I had one of, if not the first, analyst briefings in NVIDIA’s new headquarters building, a ghost town with the cafeteria not even serving food. That is where I first got to hold the NVIDIA H100.
During that visit, I had the opportunity to sit with several folks who have gone on to lead the data center AI revolution. Over a Guinness with Ian Buck at the top of the building, I distinctly remember asking him whether Grace was for NVIDIA to control its own destiny without Intel and AMD.
The NVIDIA H100 is a PCIe Gen5 GPU. That PCIe Gen5 link seems commonplace as we are on our third generation of x86 PCIe Gen5 server CPUs. In 2022, NVIDIA selected the forthcoming 4th Generation Intel Xeon “Sapphire Rapids” set to launch around the middle of 2022. Instead, it was launched early in 2023. The reason for the delay was an extremely serious security issue that would have been a no-go in production for Intel. It was serious enough to require a new stepping of the CPU. The exact issue is under NDA due to the seriousness, but it was one that when we first heard the reason in June 2023, it made a lot of sense why Intel had to delay its launch. At NVIDIA, its flagship GPU had to wait months until the new Intel Xeon CPU steppings were ready to have PCIe Gen5.
AMD was a very valid alternative, especially after winning the reference Ampere A100 generation HGX implementation, but the AMD EPYC 9004 “Genoa” was launched in late 2022 and hit availability closer to the delayed Sapphire Rapids timeline.
NVIDIA does not like to admit this publicly, but if Sapphire Rapids launched in mid-2022, it would have been selling more Hopper GPUs earlier. As the big chip company on the block, NVIDIA could not afford to let its GPU innovation be held back by Intel and AMD, just as both of those companies had their competing efforts (albeit AMD Instinct being a much stronger competitor.)
Given that NVIDIA needs Grace to direct its own future so that it does not have to wait for Intel and AMD PCIe controllers, some may find it fun to hear about the trade-offs and how one is leading to creative ways to keep Broadcom out of NVIDIA systems.