Last week, I was in Taiwan taking apart many servers, almost all GPU-based. As I went vendor-to-vendor, the AMD Instinct MI300X and MI325X platforms have something that not enough people talk about: AMD has made it easy to replace NVIDIA HGX H200 (and H100) platforms with its Instinct GPUs. Given I have taken apart many GPU servers, including finishing our NVIDIA HGX H200 8-GPU server mini-series with the Supermicro SYS-821GE-TNHR review this week (video here), and I was slightly surprised by what I saw, I think others might be too.
When AMD shows off its Instinct MI325X GPUs, it often shows the bare package or, as seen above, the top of its OAM form factor modules with heatsinks attached. (Yes, these are working MI325X GPUs.) While they may look impressive, it is not the fact that the GPUs themselves fit into a standard form factor that matters. Intel’s Gaudi 2 and Gaudi 3 did as well.
The AMD Instinct MI325X UBB Design
Key to this entire story is the AMD Instinct MI325X UBB or Universal Baseboard. If the UBB form factor sounds familiar, we started covering it on STH in 2019, albeit it has gone through several revisions since then. With the help of the OCP, or should we say major hyper-scalers, this has become the de facto industry standard for AI acceleration with eight accelerators today.
When we see the UBB assembled, it often looks like the above. We can see the GPUs, the board, and often a manufacturer’s sub-chassis around it. Very rarely do folks explain why what you see above is absolutely critical to AMD’s goal of making inroads into the AI server market, currently dominated by NVIDIA.
Perhaps the biggest reason is that AMD is not trying to sell individual OAM GPUs. Instead, it is selling the entire assembly to OEMs. Since we had to replace one of the UBBs, and we were in Taiwan at some point last week, one of these assemblies was carted into a room in its ~85lb box from Wistron. AMD makes the GPUs, but it sells the assembled UBBs.
While the front of the NVIDIA HGX 8-GPU baseboard has transitioned from the DGX-2’s front connectors to today’s large NVLink switches and heatsinks, AMD simply has a FPGA (under the small heatsink above) and a SMC on the front.
Here is the NVIDIA HGX H200 for reference with the NVLink switch heatsinks to the left.
In this front section, since AMD is using point-to-point communication, rather than a switched architecture, we get a SMC management card with an ASPEED AST2600 management controller onboard.
With AMD being point-to-point and NVIDIA being switched, the front of these boards are quite different. The similarity, though, is that in this front section, we do not have the primary interfaces between the GPUs and the rest of the system. The other side is where the real impact happens.