Today, AMD announced that it completed the acquisition of ZT Systems. I had a chance to talk to Forrest Norrod, executive vice president and general manager of the Data Center Solutions business unit at AMD, about the deal close, and I just wanted to give some additional perspective from what I have heard in the market.
AMD is Building the Customized AI Future
In my chat with Forrest, he was very focused on the engineering talent that came along with the acquisition. He also had a very different take from NVIDIA on what would be required to support hyper-scale customers in the future. Taking a step back, the NVIDIA GB200 NVL72 is both a trend-setting marvel and a product that went through a number of teething challenges to get working. A number of the installations shown in mid-Q4 2024 folks had GB200 NVL72 hardware but were still working on getting fully operational. It is a big step between the HGX platforms that have been powering AI clusters for years and the new NVL72 racks.
In our discussion, Forrest said that he believed that from a manufacturability standpoint, AMD’s GPUs should not run into the same challenges as NVIDIA had with the GB200 platforms. The ZT Systems acquisition was not as much about manufacturing the silicon as it was the systems they would go into.
ZT Systems has been building hyper-scale clusters for many years. What many folks miss in the AI discussion is that, unlike CPU-based clusters, the higher-power and often liquid-cooled AI clusters present more challenges. The relatively straightforward ones are things like adding liquid cooling to racks. A significantly more difficult challenge is managing the scale and complexity.
Just for some sense, a typical CPU-based server might have a low-speed management NIC, then one or two high-speed links per 1U of rack space that go to a standard Ethernet network. Call it 2-3 connections per node. A typical 8-GPU server has one NIC per node, one or two NICs per CPU, a management NIC, then sometimes management NICs for the high-speed NICs. In addition, CDUs, rear door heat exchangers, each rack PDU, and so forth will also have network connections for the telemetry and management networks.
The net result is that each server uses many network connections. Those network connections will terminate in different locations in the data center, so the optical cabling must be the correct type (single-mode or multi-mode), they must be cut and bundled to the correct lengths. Labels need to be created so that the right cable is plugged into the right port. There are plenty of stories in the industry where folks found the reason their short-range optics were unreliable over multi-mode fiber was that the optical cables were cut to 120m, not 100m lengths.
Network cabling is just one example. Powering the systems takes an enormous amount of heavy copper cables, and one has to deal with quick sprints to peaks and valleys of power usage that can break power distribution components. Eventually (by 2027 for high-density AI), coolant distribution units (CDUs) will need to move out of racks and so CDUs and the liquid loops need to be properly sized.
This all becomes a giant systems engineering challenge just to design the systems. Another big challenge is that AI data centers are very different, from the power sources to the weight ratings on floors to the ambient temperatures, humidity, and dust profiles.
NVIDIA’s approach to this has been the GB200 NVL72. Customers can choose racks built by one of about 14 different OEMs/ ODMs, but the NVL72 is a well-defined solution. If your data center only has room for 20 GPUs per rack, then you are out of luck. Also, if your data center only supports 42U racks, then your option is to find somewhere else, raise doors and roofs, and so forth.
In our discussion, Forrest focused on how AMD’s approach would differ. Instead of a one-size-fits-all approach, AMD would focus on co-engineering solutions with its customers. That is why AMD needs a larger engineering team that buying ZT Systems brings. Customers may want different numbers of GPUs in a rack. The liquid cooling loops are often different. Some customers are looking at not just 52U racks but 52U racks with other components on top.
ZT Systems has not just relationships with the teams at hyper-scalers building these large clusters, but it has years of learning little tricks. The importance of the “little tricks” may not seem huge, but even designing ways to get heavy racks onto raised floors quickly, shipping systems filled with fluid and the correct expansion hose sizes for temperature and pressure changes in air freight, and so forth become areas for engineering IP to accelerate system deployment.
On that note, AMD now has another challenge. It is currently competing with many of its customers, and in closing the acquisition, it also acquired a manufacturing business. AMD said that it would look for a buyer, and sell that business to not remain in the position for long. Prior to the deal closing, AMD had already started the discussion to divest of this manufacturing business.